E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
By: Jingyao Liu , Chen Huang , Zhizhao Guan and more
Potential Business Impact:
Tests computer code automatically, saving time and money.
E2EDev comprises (i) a fine-grained set of user requirements, (ii) {multiple BDD test scenarios with corresponding Python step implementations for each requirement}, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). {By evaluating various E2ESD frameworks and LLM backbones with E2EDev}, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.
Similar Papers
E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task
Software Engineering
Tests if AI can build working computer programs.
Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development
Software Engineering
Helps AI build better computer programs.
EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents
Software Engineering
Builds software faster by connecting ideas.