Clever Hans in Chemistry: Chemist Style Signals Confound Activity Prediction on Public Benchmarks
By: Andrew D. Blevins, Ian K. Quigley
Can machine learning models identify which chemist made a molecule from structure alone? If so, models trained on literature data may exploit chemist intent rather than learning causal structure-activity relationships. We test this by linking CHEMBL assays to publication authors and training a 1,815-class classifier to predict authors from molecular fingerprints, achieving 60% top-5 accuracy under scaffold-based splitting. We then train an activity model that receives only a protein identifier and an author-probability vector derived from structure, with no direct access to molecular descriptors. This author-only model achieves predictive power comparable to a simple baseline that has access to structure. This reveals a "Clever Hans" failure mode: models can predict bioactivity largely by inferring chemist goals and favorite targets without requiring a lab-independent understanding of chemistry. We analyze the sources of this leakage, propose author-disjoint splits, and recommend dataset practices to decouple chemist intent from biological outcomes.
Similar Papers
AssayMatch: Learning to Select Data for Molecular Activity Models
Machine Learning (CS)
Finds better drug data for faster discoveries.
Rep3Net: An Approach Exploiting Multimodal Representation for Molecular Bioactivity Prediction
Machine Learning (CS)
Finds new medicines faster by predicting how they work.
Challenging reaction prediction models to generalize to novel chemistry
Machine Learning (CS)
Makes computer predictions of chemical reactions more reliable.