Score: 1

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Published: July 21, 2025 | arXiv ID: 2507.15337v1

By: Narun Raman, Taylor Lundy, Kevin Leyton-Brown

Potential Business Impact:

Tests make smart computers seem smarter than they are.

Business Areas:

Q&A Community and Lifestyle

When evaluating Large Language Models (LLMs) in question-answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of $15$ different question-answering benchmarks (e.g., MMLU, HLE) and $25$ different LLMs (including small models such as Qwen 7B and relatively large models such as Llama 70B). For each model-benchmark pair, we considered $5$ ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether "none of the above" sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only before being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning after being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We conclude that MCQA is no longer a good proxy for assessing downstream performance of state-of-the-art models, and offer practical guidelines for designing more robust, bias-resistant benchmarks that better reflect LLMs' genuine reasoning capabilities.

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

Computation and Language

Computers trust their answers more when they explain them.

16 Jan 2025 2

92%

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Computation and Language

Makes AI answers more honest and fair.

19 Mar 2025 0

91%

Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

Computation and Language

Tests show questions change how well computers think.

21 Jul 2025 1

View PDF Login to Bookmark

Page Count

18 pages

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Tests make smart computers seem smarter than they are.

Technical Abstract

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?