Score: 0

Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

Published: November 26, 2025 | arXiv ID: 2511.21860v1

By: Paulo Cavalin , Cassia Sanctos , Marcelo Grave and more

Potential Business Impact:

Measures how well AI answers questions reliably.

Business Areas:
Test and Measurement Data and Analytics

In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.

Page Count
13 pages

Category
Computer Science:
Computation and Language