Score: 1

MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation

Published: August 5, 2025 | arXiv ID: 2508.02998v1

By: Haiyang Li

Potential Business Impact:

Tests if AI can write code for different languages.

Large Language Models (LLMs) have demonstrated impressive capabilities in code generation. However, current evaluation datasets suffer from issues such as the lack of runnable test cases, deviation from the distribution of real-world code, and the ability to evaluate only the Python language. These limitations undermine the credibility of the evaluation results. To address these limitations, we introduce \textbf{MRG-Bench} (Multi-language Repository-level Code Generation Benchmark), a novel dataset that provides a more accurate evaluation of LLMs in practical repository-level code generation tasks. MRG-Bench has three main features: (1) practical data sourced from real-world code repositories that align to the practical distribution, (2) multiple programming languages support, including Python, Java, and Go, and (3) project-level runnable test cases to assess the quality of the generated code. Based on MRG-Bench, we conducted extensive experiments including large language models, long-context models, and RAG-related methods. These evaluation results demonstrate that \textbf{current repository-level code generation techniques suffer from significant performance deficiencies}. To further investigate why models fail, we designed novel experiments to annotate the underlying causes of generation errors. The results explicitly show that the majority of methods suffer from "\textbf{difficulty in understanding user requirements}," failing to comprehend their assigned tasks accurately. Moreover, the impact of different repository-level contexts on this issue exhibits significant disparities across different programming languages, suggesting that, in practice, specialized contextual information needs to be designed for different languages.

Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks

Computation and Language

Tests if computers can understand long stories.

17 Apr 2025 1

89%

Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

Software Engineering

Helps computers find code errors better.

10 Nov 2025 1

89%

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Computation and Language

Makes computers write code in many languages.

12 Aug 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com github.com

Page Count

12 pages

MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation

Tests if AI can write code for different languages.

Technical Abstract

Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks

Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators