Score: 1

CodeSimpleQA: Scaling Factuality in Code Large Language Models

Published: December 22, 2025 | arXiv ID: 2512.19424v1

By: Jian Yang , Wei Zhang , Yizhi Li and more

Potential Business Impact:

Tests if AI truly knows how to code.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.

CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

Computation and Language

Helps computers answer questions in many languages.

10 Aug 2025 1

90%

Facts Do Care About Your Language: Assessing Answer Quality of Multilingual LLMs

Computation and Language

Makes learning tools more truthful for all languages.

3 Jun 2025 1

90%

The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

Computation and Language

Makes AI answer facts correctly, even in stories.

13 Oct 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

13 pages

CodeSimpleQA: Scaling Factuality in Code Large Language Models

Tests if AI truly knows how to code.

Technical Abstract

CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

Facts Do Care About Your Language: Assessing Answer Quality of Multilingual LLMs

The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers