From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level
By: Jia Li, Yuxin Su, Michael R. Lyu
Potential Business Impact:
Helps AI understand and fix large computer code.
As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. To eliminate memorization while preserving authentic logical depth, we implement an execution-driven mutation framework that utilizes the environment as a semantic oracle to regenerate ground-truth states. Furthermore, we establish a fine-grained diagnostic system using dynamic program slicing, quantifying reasoning via three orthogonal metrics: $ESV$ (reading load), $MCL$ (simulation depth), and $DFI$ (integration width). Comprehensive evaluations of frontier models (e.g., Claude-4.5-Sonnet, DeepSeek-v3.1-Terminus) reveal a prevalent aggregation deficit, where integration width serves as the primary cognitive bottleneck. Our findings provide granular white-box insights for optimizing the next generation of agentic software engineering.
Similar Papers
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
Computation and Language
Tests if AI can build whole computer programs alone.
EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models
Computation and Language
Makes AI explain things shorter and smarter.
Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings
Software Engineering
Tests AI's real-world coding skills better.