Score: 1

CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models

Published: December 6, 2025 | arXiv ID: 2512.06248v1

By: Cheng Cheng, Jinqiu Yang

Potential Business Impact:

Tests computer code for mistakes and safety.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Code-focused Large Language Models (LLMs), such as CodeX and Star-Coder, have demonstrated remarkable capabilities in enhancing developer productivity through context-aware code generation. However, evaluating the quality and security of LLM-generated code remains a significant challenge. Existing evaluation protocols for Code LLMs lack both methodological rigor and comprehensive scope. A key limitation is dataset bias, which arises from unintentional overlap between training and testing data. Furthermore, while CodeBLEU, a BLEU-based metric, is widely used to assess code similarity, it suffers from critical shortcomings, including imprecise tokenization, structural limitations, and low reference diversity. To address these challenges, we introduce CFCEval, a novel framework for evaluating the quality and security of code generated by LLMs. CFCEval mitigates dataset bias by creating a new benchmark, MLVBench, and incorporates ELRM, a new metric designed to assess the relevance between reference code and generated code. CFCEval evaluates generated code across four dimensions: programming quality, vulnerability-fixing capability, post-transformation fixing capability, and relevance. Our experiments show that CFCEval not only captures both quality and security aspects of generated code more effectively but also that its ELRM aligns more closely with human judgments than CodeBLEU, thus paving the way for future advancements in Code LLMs evaluation.

Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

Computation and Language

Makes computer code safer and easier to fix.

15 May 2025 0

90%

LLM-CSEC: Empirical Evaluation of Security in C/C++ Code Generated by Large Language Models

Artificial Intelligence

Finds security problems in computer code made by AI.

24 Nov 2025 0

89%

When Code Crosses Borders: A Security-Centric Evaluation of LLM-based Code Translation

Cryptography and Security

Finds and fixes security flaws in translated code.

8 Sep 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇦 Canada

Repos / Data Links

github.com github.com github.com

Page Count

10 pages

CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models

Tests computer code for mistakes and safety.

Technical Abstract

Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

LLM-CSEC: Empirical Evaluation of Security in C/C++ Code Generated by Large Language Models

When Code Crosses Borders: A Security-Centric Evaluation of LLM-based Code Translation