MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation
By: Haiyang Li
Potential Business Impact:
Tests if AI can write code for different languages.
Large Language Models (LLMs) have demonstrated impressive capabilities in code generation. However, current evaluation datasets suffer from issues such as the lack of runnable test cases, deviation from the distribution of real-world code, and the ability to evaluate only the Python language. These limitations undermine the credibility of the evaluation results. To address these limitations, we introduce \textbf{MRG-Bench} (Multi-language Repository-level Code Generation Benchmark), a novel dataset that provides a more accurate evaluation of LLMs in practical repository-level code generation tasks. MRG-Bench has three main features: (1) practical data sourced from real-world code repositories that align to the practical distribution, (2) multiple programming languages support, including Python, Java, and Go, and (3) project-level runnable test cases to assess the quality of the generated code. Based on MRG-Bench, we conducted extensive experiments including large language models, long-context models, and RAG-related methods. These evaluation results demonstrate that \textbf{current repository-level code generation techniques suffer from significant performance deficiencies}. To further investigate why models fail, we designed novel experiments to annotate the underlying causes of generation errors. The results explicitly show that the majority of methods suffer from "\textbf{difficulty in understanding user requirements}," failing to comprehend their assigned tasks accurately. Moreover, the impact of different repository-level contexts on this issue exhibits significant disparities across different programming languages, suggesting that, in practice, specialized contextual information needs to be designed for different languages.
Similar Papers
Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks
Computation and Language
Tests if computers can understand long stories.
Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice
Software Engineering
Helps computers find code errors better.
AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
Computation and Language
Makes computers write code in many languages.