When a machine learning model is put into production, its quality can quickly deteriorate—and without warning. It has the potential to be detrimental to the company. As a result, model monitoring is an important part of the ML model life cycle and MLOps.
All machine learning models need to be monitored at two levels: (1) at the resource level, ensuring the model is running correctly in the production environment, and (2) at the performance level, meaning monitoring the pertinence of the model over time.
At the resource level, we ask key questions such as: Is the system alive? Are the CPU, network usage, RAM, and disk space sufficient? Are requests being processed at the expected rate?
The key questions to ask at the performance level include: Is the model still an accurate representation of the pattern of new incoming data? Is it performing as well as it did during the design phase?
How well a model performs reflects the data used to train it, particularly how representative that training data is of the live request data. When the data is constantly changing, a static model cannot catch up with new patterns that are emerging and evolving without a constant source of new data.
While it is possible to detect large deviations on single predictions, smaller but still significant deviations have to be detected statistically on datasets of scored rows, with or without ground truth. Model performance monitoring attempts to track this degradation, and, at an appropriate time, it will also trigger the retraining of the model with more representative data.
How often should models be retrained?
One of the key questions teams have regarding monitoring and retraining is: how often should models be retrained? Unfortunately, there is no easy answer, as this question depends on many factors, including:
- The domain: Models in areas like cybersecurity or real-time trading need to be updated regularly to keep up with the constant changes inherent in these fields. Physical models, like voice recognition, are generally more stable because the patterns don’t often abruptly change. However, even more, stable physical models need to adapt to change: what happens to a voice recognition model if the person has a cough and the tone of their voice changes?
- The cost: Organizations need to consider whether the cost of retraining is worth the improvement in performance. For example, if it takes one week to run the whole data pipeline and retrain the model, is it worth a 1% improvement?
- The model performance: In some situations, the model performance is restrained by the limited number of training examples, and thus the decision to retrain hinges on collecting enough new data.
Once a model is in production, it must maintain its high performance over time. However, different people, particularly the DevOps team, data scientists, and the business, define good performance differently.
The concerns of the DevOps team are very familiar and include questions like:
- Is the model completing the task promptly?
- Is it using a sensible amount of memory and processing time?
This is traditional IT performance monitoring, which DevOps teams are already adept at. In this regard, ML models are similar to traditional software in terms of resource demands.
When retraining models in production, for example, the scalability of computing resources is an important consideration. Deep learning models require more resources than decision trees, which are much simpler. Overall, however, DevOps teams’ existing expertise in monitoring and managing resources can be easily applied to ML models.
The business must have a holistic outlook on monitoring, and some of its concerns might include questions like:
- Is the model delivering value to the enterprise?
- Do the benefits of the model outweigh the cost of developing and deploying it? (And how can we measure this?)
One part of this process is identifying KPIs for the original business objective. These should be monitored automatically whenever possible, but this is rarely simple. In our previous example, achieving the goal of reducing fraud to less than 0.1 percent of transactions requires establishing the ground truth. But even keeping track of this doesn’t answer the question: what is the net gain in dollars for the company?
This is an age-old challenge for software, and the pressure on data scientists to demonstrate value will only increase as the amount of money spent on machine learning rises. Effectively monitoring business KPIs is the best option available in the absence of a “dollar-o-meter.” The baseline selection is critical here, as it should ideally allow for differentiation of the value of the ML subproject rather than the overall project.
To sum up, it is critical that as part of MLOps and the ML model life cycle, data scientists and the organization understand model degradation. Practically, every deployed model should have monitoring metrics and corresponding warning thresholds to detect meaningful business performance drops as quickly as possible.