Score: 2

I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models

Published: October 20, 2025 | arXiv ID: 2510.17496v1

By: Giacomo Camposampiero , Michael Hersche , Roger Wattenhofer and more

BigTech Affiliations: IBM

Potential Business Impact:

Tests if smart computer programs can think like people.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.

Country of Origin
πŸ‡¨πŸ‡­ πŸ‡ΊπŸ‡Έ Switzerland, United States

Page Count
13 pages

Category
Computer Science:
Machine Learning (CS)