Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of jointly enhancing large language models’ (LLMs) code generation and unit test generation capabilities in the absence of ground-truth code annotations. We propose CURE, a framework based on a co-evolutionary reinforcement learning paradigm with dual coder-tester roles: the tester learns directly from the coder’s erroneous outputs, and both agents jointly optimize through interactive feedback—without requiring execution-based supervision. Methodologically, CURE integrates multi-stage policy optimization, the novel ReasonFlux architecture, and fine-tuning of the Qwen2.5-Instruct base model, while supporting test-time scaling and agentic coding extensions. Experiments show that ReasonFlux-Coder-7B/14B achieves a 5.3% absolute gain in code accuracy and a 9.0% improvement in Best-of-N performance. Moreover, the long-chain-of-thought variant attains 64.8% test generation efficiency and serves as an effective RL reward model.

Technology Category

Application Category

📝 Abstract
We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE
Problem

Research questions and friction points this paper is trying to address.

Co-evolving coding and unit test generation via reinforcement learning
Improving code generation accuracy without ground-truth supervision
Enhancing test-time scaling and agentic coding performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-evolving coder and tester via reinforcement learning
No ground-truth code needed for supervision
Improves code generation accuracy significantly