Score: 0

Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Published: December 19, 2025 | arXiv ID: 2512.18131v1

By: Le Zhang, Suresh Kothari

This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode problems across five programming languages, we assess model performance using rigorous metrics: compile-time errors, runtime errors, functional failures, and algorithmic suboptimalities. The results reveal significant performance variations, with DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness. Through detailed case studies, we identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms, highlighting the critical role of prompt engineering and human oversight in improving results. Based on these findings, we provide actionable recommendations for developers and practitioners, emphasizing that successful LLM deployment depends on careful model selection, effective prompt design, and context-aware usage to ensure reliable code generation in real-world software development tasks.

Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

Software Engineering

Makes computers better at understanding language and code.

4 Dec 2025 1

93%

An Experimental Study of Real-Life LLM-Proposed Performance Improvements

Software Engineering

Computers write faster code, but humans write best.

17 Oct 2025 1

92%

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

Software Engineering

Helps computers write computer programs from words.

23 Nov 2025 2

View PDF Login to Bookmark

Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Technical Abstract

Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

An Experimental Study of Real-Life LLM-Proposed Performance Improvements

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence