From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience
By: Zhiwei Li , Carl Kesselman , Tran Huy Nguyen and more
Potential Business Impact:
Makes computer learning experiments repeatable and clear.
Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle.
Similar Papers
Large Language Models for Software Engineering: A Reproducibility Crisis
Software Engineering
Makes science experiments with AI easier to repeat.
A Dataset For Computational Reproducibility
Software Engineering
Makes science experiments work the same everywhere.
Reproducibility of Machine Learning-Based Fault Detection and Diagnosis for HVAC Systems in Buildings: An Empirical Study
Machine Learning (CS)
Makes science experiments easier to check.