Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models
By: Lachlan McGinness, Peter Baumgartner
Potential Business Impact:
AI reasoning skills haven't improved lately.
Empirical methods to examine the capability of Large Language Models (LLMs) to use Automated Theorem Prover (ATP) reasoning strategies are studied. We evaluate the performance of State of the Art models from December 2023 and August 2024 on PRONTOQA steamroller reasoning problems. For that, we develop methods for assessing LLM response accuracy and correct answer correlation. Our results show that progress in improving LLM reasoning abilities has stalled over the nine month period. By tracking completion tokens, we show that almost all improvement in reasoning ability since GPT-4 was released can be attributed to either hidden system prompts or the training of models to automatically use generic Chain of Thought prompting strategies. Among the ATP reasoning strategies tried, we found that current frontier LLMs are best able to follow the bottom-up (also known as forward-chaining) strategy. A low positive correlation was found between an LLM response containing correct reasoning and arriving at the correct conclusion.
Similar Papers
Large Language Models Imitate Logical Reasoning, but at what Cost?
Artificial Intelligence
Makes AI think better and cost less.
Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms
Artificial Intelligence
AI helps solve hard math problems for scientists.
The 2025 Planning Performance of Frontier Large Language Models
Artificial Intelligence
AI can now plan better, like a smart assistant.