🤖 AI Summary
This work addresses the challenge of enabling unsupervised self-improvement of large code models in realistic settings where high-quality teacher models and test oracles are unavailable. The authors propose ConSelf, a novel approach that introduces code semantic entropy to measure problem-level uncertainty and construct a learnability-driven curriculum. Additionally, they design a behavioral consensus weighting mechanism to guide direct preference optimization (Con-DPO), effectively mitigating noise in self-generated labels. Requiring no external supervision, ConSelf consistently outperforms existing baselines across multiple benchmarks and diverse backbone architectures, demonstrating its effectiveness in continuously enhancing code generation capabilities under fully unsupervised conditions.
📝 Abstract
Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.