Score: 0

CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation

Published: April 29, 2025 | arXiv ID: 2504.20673v1

By: Wenjing Yin , Tianze Sun , Yijiong Yu and more

Potential Business Impact:

Tests computer code writing better than before.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive evaluation framework that reflects real-world applications. To address these gaps, we introduce CoCo-Bench (Comprehensive Code Benchmark), designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review. These dimensions capture essential developer needs, ensuring a more systematic and representative evaluation. CoCo-Bench includes multiple programming languages and varying task difficulties, with rigorous manual review to ensure data quality and accuracy. Empirical results show that CoCo-Bench aligns with existing benchmarks while uncovering significant variations in model performance, effectively highlighting strengths and weaknesses. By offering a holistic and objective evaluation, CoCo-Bench provides valuable insights to guide future research and technological advancements in code-oriented LLMs, establishing a reliable benchmark for the field.

Country of Origin
🇨🇳 China

Page Count
15 pages

Category
Computer Science:
Software Engineering