On the Generalization Gap in Self-Evolving Language Model Reasoning

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work systematically investigates whether self-evolving language models can approximate the performance of models trained with ground-truth supervision using only unlabeled prompts and a base model within a strictly closed-loop setting. Focusing on the Knights and Knaves logical reasoning task, we evaluate four strategies—single-round verification, multi-round feedback revision, iterative training, and curriculum learning—within a unified offline framework. We present the first quantitative analysis of the generalization gap between self-evolution and ground-truth supervision under rigorous closed-loop conditions, demonstrating that multi-round critique-and-revision mechanisms are particularly effective for large models such as Gemma 12B. Our results show that self-evolution consistently enhances reasoning capabilities, albeit with diminishing returns; notably, Gemma 12B after multiple rounds of revision approaches the performance of ground-truth-supervised training and achieves modest gains on real-world benchmarks.

📝 Abstract

Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the self-evolution algorithm has access only to an unlabeled prompt set and a base model, how close can internally generated supervision come to oracle-supervised training? We analyze four representative strategies in a unified offline self-evolution framework: single-round verification, multi-turn revision with feedback, iterative training, and curriculum learning. Our primary experiments use Knights and Knaves (KK) logical reasoning tasks, which provide deterministic solutions, controlled difficulty levels, and a clean testbed for easy-to-hard generalization. We first show that self-evolution consistently improves over the base model, but plateaus after excessive training compute is invested, and eventually still leaves a non-trivial gap to oracle supervision. We find that multi-turn critic-revision with large models can reach strong self-evolution performance, with Gemma 12B nearly matching oracle-supervised training. Beyond Knights and Knaves, we also evaluate self-evolution on real-world reasoning benchmarks, where gains are also modest. Overall, our results characterize when closed-loop self-evolution can help and show how internally generated supervision remains insufficient under this minimal formulation.

Problem

Research questions and friction points this paper is trying to address.

self-evolution

generalization gap

language model reasoning

oracle supervision

closed-loop learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolution

closed-loop learning

internally generated supervision