Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models
By: Banca Calvo Figueras, Rodrigo Agerri
Potential Business Impact:
Helps computers ask smart questions to check ideas.
The task of Critical Questions Generation (CQs-Gen) aims to foster critical thinking by enabling systems to generate questions that expose underlying assumptions and challenge the validity of argumentative reasoning structures. Despite growing interest in this area, progress has been hindered by the lack of suitable datasets and automatic evaluation standards. This paper presents a comprehensive approach to support the development and benchmarking of systems for this task. We construct the first large-scale dataset including ~5K manually annotated questions. We also investigate automatic evaluation methods and propose reference-based techniques as the strategy that best correlates with human judgments. Our zero-shot evaluation of 11 LLMs establishes a strong baseline while showcasing the difficulty of the task. Data and code plus a public leaderboard are provided to encourage further research, not only in terms of model performance, but also to explore the practical benefits of CQs-Gen for both automated reasoning and human critical thinking.
Similar Papers
ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
Computation and Language
Helps computers ask smart questions to make you think.
DEEPQUESTION: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance
Computation and Language
Tests AI's ability to think deeply, not just memorize.
Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models
Computation and Language
Tests computers on new science discoveries.