I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models
By: Giacomo Camposampiero , Michael Hersche , Roger Wattenhofer and more
Potential Business Impact:
Tests if smart computer programs can think like people.
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
Similar Papers
Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?
Artificial Intelligence
AI models struggle with tricky picture puzzles.
Reasoning Models Reason Well, Until They Don't
Artificial Intelligence
Makes smart computers better at solving hard problems.
A Study of Rule Omission in Raven's Progressive Matrices
Artificial Intelligence
AI learns to solve puzzles, not just copy.