Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

📅 2025-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the capability of large language models (LLMs) to generate *defect-directed test case generators* for competitive programming (CP). Addressing the lack of systematic benchmarks and rigorous task formalization in prior work, we introduce TCGBench—the first CP-specific benchmark for test generation—comprising two tasks: general test generator synthesis and defect-directed generator synthesis, supported by a high-quality, human-annotated instruction dataset. Methodologically, we integrate prompt engineering with supervised fine-tuning to establish a reproducible modeling and evaluation framework. Experiments show that state-of-the-art LLMs can produce syntactically valid test generators but significantly underperform humans in the critical task of exposing defects in human-written code. Both prompting and fine-tuning yield substantial improvements when trained on our new dataset. This study provides the first systematic characterization of LLM limitations in defect-directed test generation for CP, laying foundational groundwork for trustworthy test generation research in competitive programming settings.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to generate valid test case generators for competition-level programming problems
Evaluating LLMs' effectiveness in creating targeted test cases to expose human code bugs
Developing TCGBench benchmark to study LLM performance in test case generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes TCGBench for LLM test case generation
Generates targeted test cases to expose bugs
Uses curated dataset to enhance LLM performance
🔎 Similar Papers
No similar papers found.
Y
Yuhan Cao
Shanghai Qi Zhi Institute
Z
Zian Chen
ShanghaiTech University
K
Kun Quan
Wuhan University
Ziliang Zhang
Ziliang Zhang
University of California, Riverside
Real-time Embedded SystemeXtended Reality
Y
Yu Wang
Institute of Information Engineering, Chinese Academy of Sciences
X
Xiaoning Dong
Tsinghua University
Y
Yeqi Feng
Huazhong University of Science and Technology
G
Guanzhong He
Tsinghua University
J
Jingcheng Huang
J
Jianhao Li
ShanghaiTech University
Y
Yixuan Tan
J
Jiafu Tang
Y
Yilin Tang
Shanghai Qi Zhi Institute
J
Junlei Wu
ShanghaiTech University
Q
Qianyu Xiao
Peking University
Can Zheng
Can Zheng
University of Pittsburgh
Data MiningNatural Language ProcessingMedical AI
S
Shouchen Zhou
ShanghaiTech University
Yuxiang Zhu
Yuxiang Zhu
Arizona State University, Columbia University
3D printingPolymer synthesisEnergy storage and conversion
Y
Yiming Huang
Independent Researcher
T
Tian Xie
Independent Researcher
Tianxing He
Tianxing He
Tsinghua University
NLP