Score: 3

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Published: March 9, 2025 | arXiv ID: 2503.06680v2

By: Wei Li , Xin Zhang , Zhongxin Guo and more

BigTech Affiliations: Microsoft

Potential Business Impact:

Tests how well AI can add new features to computer code.

Business Areas:

Facial Recognition Data and Analytics, Software

Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs' automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

Artificial Intelligence

Lets computers solve hard science and math problems.

8 Apr 2025 3

88%

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Software Engineering

Tests computer code better for websites.

16 Jun 2025 2

88%

FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

Machine Learning (CS)

Tests AI's ability to build physics simulations.

23 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇨🇳 United States, China

Repos / Data Links

github.com github.com

Page Count

17 pages

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Tests how well AI can add new features to computer code.

Technical Abstract

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs