CodeContests+: High-Quality Test Case Generation for Competitive Programming

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The scarcity of high-quality test cases in programming competitions severely hampers the accuracy of model evaluation. Method: This paper introduces the first LLM-based agent framework for test-case generation tailored to competitive programming, integrating multi-stage prompt engineering, feedback-driven optimization grounded in 1.72 million real submission logs, and a reinforcement learning–based evaluation and validation mechanism to automatically generate boundary-aware, robust test cases. Contribution/Results: We release CodeContests+, an enhanced benchmark dataset that significantly improves evaluation accuracy and true positive rate. Experiments demonstrate that CodeContests+ substantially enhances model assessment fidelity over the original CodeContests dataset and enables superior policy optimization in LLM-based reinforcement learning training.

Technology Category

Application Category

📝 Abstract

Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality test cases for competitive programming

Improving evaluation accuracy with better test cases

Enhancing LLM reasoning via test case quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based agent generates high-quality test cases

Improved accuracy with higher True Positive Rate

Enhances LLM Reinforcement Learning performance significantly

🔎 Similar Papers

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

2024-10-01arXiv.orgCitations: 9

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

2024-06-28arXiv.orgCitations: 8

Authors to Follow