Evaluating the Logical Reasoning Abilities of Large Reasoning Models
By: Hanmeng Liu , Yiran Ding , Zhizhang Fu and more
Potential Business Impact:
Tests if computers can think logically like people.
Large reasoning models, often post-trained on long chain-of-thought (long CoT) data with reinforcement learning, achieve state-of-the-art performance on mathematical, coding, and domain-specific reasoning benchmarks. However, their logical reasoning capabilities - fundamental to human cognition and independent of domain knowledge - remain understudied. To address this gap, we introduce LogiEval, a holistic benchmark for evaluating logical reasoning in large reasoning models. LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis), sourced from high-quality human examinations (e.g., LSAT, GMAT). Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance, yet exhibit uneven capabilities across reasoning types and formats, highlighting limitations in their generalization. Our analysis reveals that human performance does not mirror model failure distributions. To foster further research, we curate LogiEval-Hard, a challenging subset identified through a novel screening paradigm where small-model failures (Qwen3-30B-A3B) reliably predict difficulties for larger models. Modern models show striking, consistent failures on LogiEval-Hard. This demonstrates that fundamental reasoning bottlenecks persist across model scales, and establishes LogiEval-Hard as both a diagnostic tool and a rigorous testbed for advancing logical reasoning in LLMs.
Similar Papers
Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach
Machine Learning (CS)
Makes AI better at solving math problems.
A Survey on Large Language Models for Mathematical Reasoning
Artificial Intelligence
Helps computers solve math problems like a person.
Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning
Artificial Intelligence
Tests if AI can think like a person.