Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices
By: Paulo Cavalin , Cassia Sanctos , Marcelo Grave and more
Potential Business Impact:
Measures how well AI answers questions reliably.
In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.
Similar Papers
Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach
Computation and Language
Makes AI answer questions more fairly.
SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models
Computation and Language
Tests AI to see if it's reliable.
Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework
Computation and Language
Makes AI answers about health more trustworthy.