Score: 1

Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models

Published: May 26, 2025 | arXiv ID: 2505.19676v3

By: Lachlan McGinness, Peter Baumgartner

Potential Business Impact:

AI reasoning skills haven't improved lately.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Empirical methods to examine the capability of Large Language Models (LLMs) to use Automated Theorem Prover (ATP) reasoning strategies are studied. We evaluate the performance of State of the Art models from December 2023 and August 2024 on PRONTOQA steamroller reasoning problems. For that, we develop methods for assessing LLM response accuracy and correct answer correlation. Our results show that progress in improving LLM reasoning abilities has stalled over the nine month period. By tracking completion tokens, we show that almost all improvement in reasoning ability since GPT-4 was released can be attributed to either hidden system prompts or the training of models to automatically use generic Chain of Thought prompting strategies. Among the ATP reasoning strategies tried, we found that current frontier LLMs are best able to follow the bottom-up (also known as forward-chaining) strategy. A low positive correlation was found between an LLM response containing correct reasoning and arriving at the correct conclusion.