Score: 3

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Published: October 16, 2025 | arXiv ID: 2510.14509v2

By: Jingyao Liu , Chen Huang , Zhizhao Guan and more

Potential Business Impact:

Tests if AI can build working computer programs.

Business Areas:

Simulation Software

The rapid advancement in large language models (LLMs) has demonstrated significant potential in End-to-End Software Development (E2ESD). However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities. To address these limitations, we present E2EDev, a novel benchmark grounded in the principles of Behavior-Driven Development (BDD), which evaluates the capabilities of E2ESD frameworks by assessing whether the generated software meets user needs through mimicking real user interactions (Figure 1). E2EDev comprises (i) a fine-grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). By evaluating various E2ESD frameworks and LLM backbones with E2EDev, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Software Engineering

Tests computer code automatically, saving time and money.

16 Oct 2025 3

89%

A Study on the Improvement of Code Generation Quality Using Large Language Models Leveraging Product Documentation

Software Engineering

Makes apps work right by testing them automatically.

22 Mar 2025 0

89%

Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

Software Engineering

Helps AI build better computer programs.

6 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇸🇬 China, Singapore

Repos / Data Links

github.com github.com github.com huggingface.co

Page Count

52 pages

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

Tests if AI can build working computer programs.

Technical Abstract

E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task

A Study on the Improvement of Code Generation Quality Using Large Language Models Leveraging Product Documentation

Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development