Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations
By: Olawale Salaudeen , Haoran Zhang , Kumail Alhamoud and more
Potential Business Impact:
Finds hidden computer mistakes in new situations.
Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.
Similar Papers
Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?
Machine Learning (CS)
Fixes AI that cheats by using bad shortcuts.
Bias as a Virtue: Rethinking Generalization under Distribution Shifts
Machine Learning (CS)
Makes computer learning work better on new data.
Latent space analysis and generalization to out-of-distribution data
Machine Learning (Stat)
Finds when computers are shown wrong information.