๐ค AI Summary
The scarcity of high-quality test cases in programming competitions severely hampers the accuracy of model evaluation. Method: This paper introduces the first LLM-based agent framework for test-case generation tailored to competitive programming, integrating multi-stage prompt engineering, feedback-driven optimization grounded in 1.72 million real submission logs, and a reinforcement learningโbased evaluation and validation mechanism to automatically generate boundary-aware, robust test cases. Contribution/Results: We release CodeContests+, an enhanced benchmark dataset that significantly improves evaluation accuracy and true positive rate. Experiments demonstrate that CodeContests+ substantially enhances model assessment fidelity over the original CodeContests dataset and enables superior policy optimization in LLM-based reinforcement learning training.
๐ Abstract
Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.