Score: 2

Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Published: June 7, 2025 | arXiv ID: 2506.06821v3

By: Yuhan Cao , Zian Chen , Kun Quan and more

Potential Business Impact:

Helps computers find bugs in other computer code.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

Evaluating Code Generation of LLMs in Advanced Computer Science Problems

Artificial Intelligence

Helps computers write harder code, but not perfectly.

21 Apr 2025 1

90%

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Software Engineering

Lets anyone write computer programs with plain English.

3 Mar 2025 1

90%

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Software Engineering

Tests computer code to find mistakes.

13 Jun 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com github.com

Page Count

37 pages

Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Helps computers find bugs in other computer code.

Technical Abstract

Evaluating Code Generation of LLMs in Advanced Computer Science Problems

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure