From Tea Leaves to System Maps: A Survey and Framework on Context-aware Machine Learning Monitoring
By: Joran Leest , Claudia Raibulet , Patricia Lago and more
Potential Business Impact:
Helps AI understand why it's making mistakes.
Machine learning (ML) models in production fail when their broader systems -- from data pipelines to deployment environments -- deviate from training assumptions, not merely due to statistical anomalies in input data. Despite extensive work on data drift, data validation, and out-of-distribution detection, ML monitoring research remains largely model-centric while neglecting contextual information: auxiliary signals about the system around the model (external factors, data pipelines, downstream applications). Incorporating this context turns statistical anomalies into actionable alerts and structured root-cause analysis. Drawing on a systematic review of 94 primary studies, we identify three dimensions of contextual information for ML monitoring: the system element concerned (natural environment or technical infrastructure); the aspect of that element (runtime states, structural relationships, prescriptive properties); and the representation used (formal constructs or informal formats). This forms the Contextual System-Aspect-Representation (C-SAR) framework, a descriptive model synthesizing our findings. We identify 20 recurring triplets across these dimensions and map them to the monitoring activities they support. This study provides a holistic perspective on ML monitoring: from interpreting "tea leaves" (i.e., isolated data and performance statistics) to constructing and managing "system maps" (i.e., end-to-end views that connect data, models, and operating context).
Similar Papers
Monitoring Machine Learning Systems: A Multivocal Literature Review
Software Engineering
Keeps computer smarts working right all the time.
Tracing Distribution Shifts with Causal System Maps
Software Engineering
Finds why computer learning makes mistakes.
Anomaly Detection and Early Warning Mechanism for Intelligent Monitoring Systems in Multi-Cloud Environments Based on LLM
Machine Learning (CS)
Finds computer problems before they happen.