Score: 1

Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Published: July 23, 2025 | arXiv ID: 2507.17747v2

By: Linbo Cao, Jinman Zhao

Potential Business Impact:

Tests AI reasoning, not just memorized answers.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates--where one model is given the official answer to defend, and another constructs and defends an alternative answer--adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm's effectiveness on a subset of MMLU-Pro questions, complete with standardized protocols and reference models. Empirical results validate the robustness of the method and its effectiveness against data contamination--a Llama 3.1 model fine-tuned on test questions showed dramatic accuracy improvements (50% -> 82%) but performed worse in debates. Results also show that even weaker judges can reliably differentiate stronger debaters, highlighting how debate-based evaluation can scale to future, more capable systems while maintaining a fraction of the cost of creating new benchmarks. Overall, our framework underscores that "pretraining on the test set is no longer all you need," offering a sustainable path for measuring the genuine reasoning ability of advanced language models.

Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Computation and Language

Tests AI reasoning without making new questions.

23 Jul 2025 1

89%

DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models

Computation and Language

Tests if AI can argue and judge debates.

10 Feb 2025 0

89%

Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

Computation and Language

Helps AI judge debates like a person.

5 Jun 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇦 Canada

Repos / Data Links

github.com

Page Count

22 pages

Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Tests AI reasoning, not just memorized answers.

Technical Abstract

Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models

Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation