When a machine learning model is put into production, its quality can quickly deteriorate—and without warning. It has the potential to be detrimental to the company. As a result, model monitoring is an important part of the ML model life cycle and MLOps.
All machine learning models need to be monitored at two levels: (1) at the resource level, ensuring the model is running correctly in the production environment, and (2) at the performance level, meaning monitoring the pertinence of the model over time.
We ask key questions at the resource level: Is the system still functional? Is there enough CPU, network usage, RAM, and disc space? Is the expected rate of request processing being met?
At the performance level, the key questions to ask are: Is the model still an accurate representation of the pattern of new incoming data? Is it working as well as it did when it was designed?
The data used to train a model, particularly how representative training data is of the live request data, determines how well it performs. Without a constant source of new data, a static model cannot keep up with new patterns that emerge and evolve when the data is constantly changing.
While large deviations on single predictions can be detected, smaller but still significant deviations on datasets of scored rows, with or without ground truth, must be detected statistically. Model performance monitoring attempts to track this deterioration and, when necessary, trigger retraining of the model with more representative data.
How often should models be retrained?
How often should models be retrained? This is one of the most common questions teams have about monitoring and retraining. Unfortunately, there is no simple answer to this question because it is dependent on several factors, including:
- The domain: To keep up with the constant changes in fields like cybersecurity and real-time trading, models must be updated regularly. Physical models, such as voice recognition are more stable because patterns do not change frequently. More importantly, even the most stable physical models must adapt to change: what happens to a voice recognition model if the person coughs and their voice tone changes?
- The cost: Companies must weigh whether the cost of retraining is justified by the improved performance. Is a one percent improvement worth it if it takes a week to run the entire data pipeline and retrain the model?
- The model performance: In some cases, the model’s performance is hampered by a lack of training examples, so the decision to retrain is contingent on gathering enough new data.
Once a model is in production, it must maintain its high performance over time. However, different people, particularly the DevOps team, data scientists, and the business, define good performance differently.
The DevOps team’s concerns are all too familiar, and they include things like:
- Is the model finishing the task on time?
- Is a reasonable amount of memory and processing time being used?
This is the traditional IT performance monitoring type that DevOps teams are already familiar with. In this regard, in terms of resource demands, ML models are similar to traditional software.
When retraining models in production, for example, the scalability of computing resources is an important consideration. Deep learning models require more resources than decision trees, which are much simpler. Overall, however, DevOps teams’ existing expertise in monitoring and managing resources can be easily applied to ML models.
The company must take a holistic approach to monitoring, and some of its concerns may include the following:
- Is the model bringing value to the company?
- Do the model’s advantages outweigh the costs of developing and implementing it?
One part of this process is identifying KPIs for the original business objective. These should be monitored automatically whenever possible, but this is rarely simple. In our previous example, achieving the goal of reducing fraud to less than 0.1 percent of transactions requires establishing the ground truth. But even keeping track of this doesn’t answer the question: what is the net gain in dollars for the company?
The pressure on data scientists to demonstrate value will only increase as the amount of money spent on machine learning rises. Without a “dollar-o-meter,” effectively monitoring business KPIs is the best option. The baseline should ideally allow for differentiation of the value of the ML subproject rather than the overall project.
To summarise, data scientists and organizations must understand model degradation as part of MLOps and the ML model life cycle. Every deployed model should, in practice, have monitoring metrics and warning thresholds in place to detect significant business performance drops as quickly as possible.