Score: 2

Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

Published: August 26, 2025 | arXiv ID: 2508.19035v1

By: Congchi Yin , Tianyi Wu , Yankai Shu and more

BigTech Affiliations: Massachusetts Institute of Technology

Potential Business Impact:

Teaches computers to figure out hidden rules.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment. This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning, neglecting the integrated reasoning process that is indispensable for humans discovery of real world. We introduce a novel evaluation paradigm, \textit{black-box interaction}, to tackle this challenge. A black-box is defined by a hidden function that maps a specific set of inputs to outputs. LLMs are required to unravel the hidden function behind the black-box by interacting with it in given exploration turns, and reasoning over observed input-output pairs. Leveraging this idea, we build the \textsc{Oracle} benchmark which comprises 6 types of black-box task and 96 black-boxes. 19 modern LLMs are benchmarked. o3 ranks first in 5 of the 6 tasks, achieving over 70\% accuracy on most easy black-boxes. But it still struggles with some hard black-box tasks, where its average performance drops below 40\%. Further analysis indicates a universal difficulty among LLMs: They lack the high-level planning capability to develop efficient and adaptive exploration strategies for hypothesis refinement.

Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning

Artificial Intelligence

Helps computers guess better with less information.

3 Sep 2025 2

89%

From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models

Artificial Intelligence

Computers change how they think based on how hard a problem is.

13 Nov 2025 0

89%

Reasoning Models Reason Well, Until They Don't

Artificial Intelligence

Makes smart computers better at solving hard problems.

25 Oct 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇨🇳 United States, China

Page Count

71 pages

Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction

Teaches computers to figure out hidden rules.

Technical Abstract

Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning

From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models

Reasoning Models Reason Well, Until They Don't