Score: 0

CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation

Published: April 29, 2025 | arXiv ID: 2504.20673v1

By: Wenjing Yin , Tianze Sun , Yijiong Yu and more

Potential Business Impact:

Tests computer code writing better than before.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive evaluation framework that reflects real-world applications. To address these gaps, we introduce CoCo-Bench (Comprehensive Code Benchmark), designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review. These dimensions capture essential developer needs, ensuring a more systematic and representative evaluation. CoCo-Bench includes multiple programming languages and varying task difficulties, with rigorous manual review to ensure data quality and accuracy. Empirical results show that CoCo-Bench aligns with existing benchmarks while uncovering significant variations in model performance, effectively highlighting strengths and weaknesses. By offering a holistic and objective evaluation, CoCo-Bench provides valuable insights to guide future research and technological advancements in code-oriented LLMs, establishing a reliable benchmark for the field.

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Computation and Language

Helps computers solve tricky planning problems better.

6 Apr 2025 1

91%

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Software Engineering

Tests if AI can understand huge computer programs.

11 Sep 2025 1

90%

Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

Computation and Language

Makes computer code safer and easier to fix.

15 May 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

15 pages

CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation

Tests computer code writing better than before.

Technical Abstract

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective