In this article, I share notes on handling concept drift for machine learning models.
Concept drift occurs in an online supervised learning setting, when the relationship between the input data X and output data y is altered to the extent that a model mapping X to y can no longer do so with the same efficacy.
In online supervised learning, there are three types of drift that can occur: (1) feature drift, i.e. distribution of X, (2) real concept drift i.e. relation between X and y or p(y|X), and (3) change in the prior distribution p(y), e.g. new classes arrived. While both feature and prior distribution changes may be interesting to monitor, for purposes that extend beyond understanding changes in the problem space, it is only real concept drift that we are chiefly concerned with. Consider the following scenario: Google decides to increase the price of their flagship Android devices by 20%, making them more appealing to certain segments and less to others whom will detract to lower end versions of the brand or switch brands altogether to acquire devices within the original price range. As a result, the distribution of users signing up from specific mobile platforms may change. This would be a feature drift. If, however, this change is not sufficient to cause the model to err in its ability to, for example, predict user retention or quality of experience because despite the shifts in demographics most users will remain within the same device price range, and therefore have a similar initial experience, there may very well be no real concept drift.
Gama et. al (2013) attempted to provide a framework for reasoning about concept drift. In their framework, they give a generic schema for online adaptive learning, which covers offline learning strategies as well. Indre (2010), on the other hand, delineated mechanisms for handling concept drift. Here, I attempt to combine the information from these two bodies of work into one succinct formulation for the practitioner.
Types of Change
First, and foremost, we categorize the types concept drift by the type of change in a given dimension.
Sudden: Occurs abruptly, causing the relation between X and y to become altered almost immediately.
Gradual: the drift occurs progressively. In an online supervised setting, one can observe a gradual decline in performance, e.g. a steady percentage drop in precision over the course of a few days or hours.
Always New: the new relation between X and y is hardly ever repeating, or it shows no regularity on its occurrence.
Reoccurring: the changes repeat themselves, though not necessarily in periodic fashion, e.g. seasonal changes impacting choice of music or large album releases in the summer causing a surge in single album streaming.
There seems to be no mutual exclusivity between speed and frequency types of change. A scenario where concept drift happens suddenly, and occurs at regular intervals is plausible, e.g. New Year’s music listening preferences.
Types of Detection
When it comes to detection, one could summarize the various methods into two distinct categories: performance based and distribution based.
Performance based detection
Here, the object of measurement would be either (a) metrics related to the model, such as accuracy, recall, and precision, or (b) properties of the data itself, e.g. feature mean.
Consider a churn prediction model that performs with precision p and recall r. Should the precision, either suddenly or over time degrade to some value p - d, where p - d > p - ẟ, where ẟ is a defined threshold, then drift is considered to have occured.
Note that this approach requires feedback, which in most practical scenarios can be delayed (e.g. 7th day churn takes 7 days to get feedback on the prediction) and even hard to attribute (e.g. recommendation from search).
Distribution based detection
For these methods, the object of measurement are distributions over two distinct windows: a reference window over past examples and a window over the most recent examples.
As an example, the distribution of a given variable X over another variable z, p(z|X), can be monitored for a currently running model and compared to an historical baseline. As done in performance based methods, one can compare the difference between the distributions against a defined threshold to determine if drift has occured. Preferably, the distributions under monitoring should involve the feedback to determine the model’s performance in p(y|X).
However, unlike performance methods which often require feedback, distribution based methods may not, because the distributions can be based on the model input and output. However, they tend to be more computationally demanding, and without a direct relationship to the the correct outcome y, they may not necessarily indicate real concept drift, but feature drift or prior drift. Informative multivariate distributions would tie the input data with the output of the model and its feedback, where the first two can be easily observed in most cases. So, if a feature drift occurs to the point that model’s prior distribution changes, it could be indicative of drift. Again though, this would not necessarily indicate that real concept drift has occured.
Some specific ways of doing detection are briefly hinted in a later section. Knowing how to detect drift is half the challenge. The other side of that challenge is adapting to the change.
The strategies for adapting to concept drift combine assumptions about the expectation of drift and ways of dealing with them.
There is a dimension to adaptation strategies that revolves around the cardinality of the model. Here, we distinguish single model based methods and ensemble based methods. Another dimension is the method of adaptation, where we distinguish triggering and evolving methods.
One model operating solo.
Triggering: Detection - The model would be trained over variable window lengths whenever drift is detected. The length of the window would be a parameter defined to minimize the drift. For example, a model trained over a 3 month window, given a sudden drift, may benefit from retraining over the past 2 weeks only, provided it performs better.
This approach is called informed adaptation, in that decisions are being made based on known drift events. The triggering here would be detecting when drift has occured to retrain.
Evolving: Forgetting - In this scenario the model can be trained over fixed or weight decaying windows continuously to prevent drift. Prior analysis can inform the right window interval (or size), weighting parameter for the input data, and training frequency. When fixed windows are used, the approach is called abrupt forgetting; when decaying windows are used, the approach is called gradual forgetting.
In both cases, we are dealing with blind adaptation, where the model is updated to avoid drift, without knowing whether there has been real concept drift or not. However, sudden drifts may occur within the chosen window, and they will not necessarily be addressed. In fact, such drifts can end up being systematically incorporated into the model. Imagine a problem where the best window interval is two weeks, but where the model is trained over a three week period. Over time, the classifier’s performance may be stable, but it may never reach the peak performance it would have if it were being trained over a two week period instead.
A fundamental trait of blind adaptation is that it does not preceed drift detection. Is it pro-active and optimistic, which can lead to missed opportunities and hidden costs.
Several models, operating in unison.
Triggering: Contextual - With this method, several models are trained and a meta-learner is used to learn how to define context, i.e. how to map each input to the right model. For example, different models could be trained on different groups (slices of data). The triggering here would be knowing which model to switch to for a given input.
Evolving: Dynamic Ensemble - Different models are trained over different windows, and combined using a weighted score. The weighting can be attributed by a meta-learner. For example, one model could be trained over the past two weeks, another over a month, and third over the past 6 months.
The ensemble model adaptive strategies also make assumptions about the expected type of drift. The triggering based ensemble approach, in assigning examples to a specific model, has the implicit assumption of time invariance. A particular group could very well be made up of event instances that occur on the weekend. But arriving at such segregation of models is left up to the practitioner. The evolving ensemble approach, on the other hand, may assume time variances to be the only factor, and thus not directly account for recurring drift. The choice of window lenghts can very well make or break the solution's ability to counter concept drift.
Choosing a strategy
Given the implicit and explicit assumptions on the expected type of drift for each of the strategies, the following matching between type of drift and strategy can be delimited:
Sudden drift → Detection, Forgetting
Reoccuring drift → Contextual
Gradual drift → Dynamic Ensemble
So, which technique is proper? As common in life, science and engineering, it depends. In this case, two factors can be considered: the type of expected drift and the goal of the application. To make this more palpable, I shall attempt to illustrate a choice problem with one example.
Suppose we have a system that recommends restaurants grouped into cuisines to new users on a food ordering platform. The training data is computed over a fixed window of six weeks, every six weeks, to capture favorites from highly engaging customers, e.g. people that order at least three times a week. And a model is trained to predict restaurants to suggest based on demographic data and other sign up properties of users, e.g. are they registering from a mobile device or desktop computer. Now, the choice of a fixed window here readily implies the assumption of stable taste over the window, i.e. that whatever the costumers like to order doesn't change much within our six week window. This is would be us using the forgetting approach.
If we believe that food taste differences over geographic areas may be too distinct for a single classifier to perform well, we could opt for a contextual approach. In this case, we could train one model per city or country. Furthermore, if we were convinced that food palettes are more discriminating than geographical nuances, we could conversely learn a different model for each food palette instead of geographical area. For example, we could train a model that, given a information we have at sign up, would tell us how likely a user would be to order from an Indian food restaurant, doing this for each type of cuisine we have avaiable in our catalogue. Note that we could also use a combination of geoprahic data and cuisine to form our context. The next question is, can we assign a newly registered user to the correct palette model? Or better yet, given the prediction from different context models, can we decide which one to go for?
In some contexts this can be very easy, e.g. you're either in a New York or Stockholm, so if your model is based on location, problem solved. Kinda. However, in certain cases it can be slightly more complicated, e.g. running inference over 16 different cuisine models and picking the right one. In that case, we may then opt for a dynamic ensemble approach over a contextual one, having several recommendations for different taste palettes, operating under the assumption that this could maximize orthogonality in the suggestions while ranking them according to the properties of the users on sign up, such as age and geographical region. Our meta-learner would learn to decide the ranking of each cuisine based on the input from our cuisine models, potentially performing better than us simply sorting them based on their probability score.
Needless to say, being optimistic and simply re-training over a fixed window can be substantially more practical to implement than either the contextual or dynamic ensemble approaches. The down side being loss of opportunities. The key is to understand what your problem needs are, because going that extra step may not really bring much value.
How to Measure
We have already noted the distinct types of drift detections. Here, we briefly describe the ways in which one can go about measuring drift.
For this purpose, we invoke terminology from Gama et. al., and note two prime approaches: statistical process control inspired methods and multi distribution methods.
Statistical Process Control (SPC) Usually performed on the instance level, in an online learning setting where feedback is almost immediate. It relies on error measurements based on variance and confidence intervals to detect if examples are coming from a different distribution, e.g. 95%. Picture a live sensor seismic monitoring system, connected to a model that tries to predict the next value. Here, feedback would be readily available from the sensors, and the model’s error could easily computed based on the mean square error or similar measure. To use this type of approach, one needs to consider the statistical distribution of the data in question. For instance, in a scenario similar to the previous one, with heat sensors and a model to detect overheating, the outcome would be a Bernoulli distribution, for which misclassification thresholds could be well defined for.
To set a threshold, one can measure the consequences at the bounds, e.g. what does 90% confidence in deviation detection imply, on both lower and upper boundaries, in terms of changes to the predictions of the model? Could it mean potentially missing catashtrophic failure? Or, back to our cuisine problem, if the age distribution changes to the limits of our threshold, by how much would the ranking or the ability of the model to correctly predict relevant food palette change?
It starts off by having a fixed window as the baseline, e.g. historically observed measures, then testing future windows under the null hypothesis that the distributions are the equal. The windows may be of equal or different sizes; and they can be univariate or multivariate. The differences are computed using confidence bound methods and distribution similarity measures such as (1) Chernoff bounds (see Kifer et al. 2004), (2) entropy (Vorburger and Bernstein 2006) or (3) Kullback-Leiber (KL) Divergence (Dasu et al. 2006; Sebastiao and Gama 2007)
The main idea is to identify changes in the data from one window to another. Setting the thresholds here also require analysis over the distributions, factoring in the impact of change.
On an observational note, distribution based approaches fit well with both detection based methods (identify drift and retrain) and contextual methods (to identify when new examples fall out of scope of the current groups). However, they can also be used offline to compare windows in order to select the best retraining frequency and window interval (or size) for a forgetting setting.
Very briefly, other ideas include
- Comparing accuracy and similar measures over a long period of time with recent accuracy; if they are different, drift can be considered to have occured
- Compare statistics on sub-windows within a larger window using Hoeffding bounds (Bifet and Gavalda 2006; 2007); if any two sub-windows exhibit distinct enough means, older sub-windows used for training are dropped
In a nutshell, SPC methods work on instances while multi distribution methods work on batches of instances.
Concept drift can be a hard to address, depending on the context. In a online learning setting where feedback is readily available, one can simply monitor the performance of the model(s) in question for drift. A drop in prediction accuracy, for instance, is a clear red flag on disparity between the output of the model y and the real distribution p(y|X), assuming all else to be equal.
For a good number of cases, though, that kind of feedback is not readily available, due to delays or arduous attribution, leaving no recourse but to observe the input data and the output of the model itself. In such cases, one resorts to monitoring distributions and comparing them, usually over different windows. This monitoring can be done to trigger retraining based on detected drifts or to tune the retraining policy, covering window interval (or size) and retraining frequency. The choices of measures are slightly diverse, but they all boil down to (1) understanding the impact of drift in the measure of interest, and (2) setting sensible thresholds for it. One could postulate that efficiency would demand finding the most impactful distributions to monitor, and focus on them, rather than attempting to monitor everything, even if one could measure everything, if only for the sake of sanity of the people responsible for responding to alerts.
All sections thus far have not considered the operational aspect, where many errors can and inevitably do occur. A database downtime can cause features to become unavailable, making the application switch to a default value that can lead to observations in the distribution p(y|X) that are not real, i.e. not pertaining to the relation between X and y, but the application itself. Notably, this would be a virtual feature drift problem, which could be picked up via distribution or even statistical monitoring (e.g. mean and variance) of the variable in question.
One cannot help but draw a parallel to the world of operational systems monitoring, where the prime concern is not the drift in the mapping of an input X to an output y, but merely changes on a potentially multidimensional value. Consider a metric such as daily orders, which is a time series, with dimensions such as platform and country. General and fine-grained approaches for detecting anomalies on this type of data have been extensively studied, and there are tools that can used to address them.
Plainly, employing operational monitoring techniques around online models is indispensable to uncovering issues of similar trait to the one described earlier. However, a clear distinction has to be made between finding anomalies and detecting drift. Anomaly techniques can do very well at detecting subtle unexpected changes in time series’. However, most drift techniques hinge on, either directly or via proxy, identifying the disparity between the input X and output y. Even when one resorts to monitoring distributions of the input data X, is it not anomalies per se that one seeks within the variable itself, but any change that can impede an effective result. Colloquially, it doesn’t matter if X changes, unless there is strong suggestion that p(y|X) has changed as well. Thus, more attention here is placed on y and its relationship to X, as opposed to just X or just y, as one would in anomaly detection. What concept drift monitoring does, in essence, is anomaly detection on the relationship between X and y.
Finally, we conclude with a guide to reasoning about concept drift in your setting, with a set of three questions
(1) What are the expected types of change? Informs the adaptive strategy of choice
(2) Is there a way to get feedback on the model? Informs the detection methods available
(3) What is the risk of drift? Quantifying the risk can help define thresholds, which apply to any type of detection mechanism
In the end, one can use the following checklist for setting up concept drift
- Define expected type of changes
- Define feedback cycle, if it applies
- Select adaptive strategy
- Study measures and compute thresholds for detection
(1) A Survey on Concept Drift Adaptation · Gama, J. et al (2013)
(2) Learning under Concept Drift: an Overview · Indre Zˇliobaite (2010)