The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks
By: Arda Uzunoglu, Tianjian Li, Daniel Khashabi
Potential Business Impact:
Makes computer tests show true skill levels.
Benchmarks shape scientific conclusions about model capabilities and steer model development. This creates a feedback loop: stronger benchmarks drive better models, and better models demand more discriminative benchmarks. Ensuring benchmark reliability is therefore essential for trustworthy evaluation and meaningful progress. In this work, we study benchmark reliability from a distributional perspective and introduce benchmark harmony, which measures how uniformly a model's performance is distributed across the subdomains of a benchmark. We posit that high harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across subdomains. Across 19 multiple-choice benchmarks and five model families, we map each benchmark onto a mean-variance plane of harmony computed across models, where high mean and low variance signal more reliable evaluation. Our analysis shows that less harmonious benchmarks can give misleading results, since overall accuracy may be disproportionately influenced by specific subdomains. For instance, ARC-Easy is overwhelmed by questions on Biological Concepts, overshadowing other critical subdomains such as Geography, Physics, Chemistry, and Environmental Science. By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.
Similar Papers
Multi-domain performance analysis with scores tailored to user preferences
Performance
Finds best ways to judge computer performance.
Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
Computation and Language
Makes AI smarter with better tests.
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation
Machine Learning (CS)
Organizes AI tests for better learning.