Score: 1

Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

Published: November 6, 2025 | arXiv ID: 2511.04064v1

By: Zhengran Zeng , Yixin Li , Rui Xie and more

Potential Business Impact:

Helps AI build better computer programs.

Business Areas:

Simulation Software

The development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering. However, the scientific evaluation of these systems is hampered by significant challenges, including overly simplistic benchmarks and the difficulty of conducting fair comparisons between different agent architectures due to confounding implementation variables. To address these limitations, we first construct a challenging and dynamically curated E2EDevBench to simulate realistic development scenarios. Second, we propose a hybrid evaluation framework that combines test-case-based functional assessment with fine-grained, LLM-based requirement verification. Using this framework, we conduct a controlled empirical study on three representative agent architectures implemented upon a unified foundation to isolate the impact of workflow design. Our findings reveal that state-of-the-art agents can fulfill approximately 50\% of requirements on \bench{}, but their success is critically dependent on the architectural strategy for task decomposition and collaboration. Furthermore, our analysis indicates that the primary bottleneck is the omission of requirements and inadequate self-verification. This work provides the community with a more realistic benchmark, a comprehensive evaluation framework, and crucial insights into the current capabilities and core challenges of software development agents, guiding future research toward enhancing requirement comprehension and planning.

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

Software Engineering

Helps AI build better computer programs.

10 Oct 2025 1

93%

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

Software Engineering

Helps computers build and fix software better.

10 Oct 2025 0

92%

Survey on Evaluation of LLM-based Agents

Artificial Intelligence

Tests how smart AI agents can act and learn.

20 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

12 pages

Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

Helps AI build better computer programs.

Technical Abstract

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

Survey on Evaluation of LLM-based Agents