INFUSER: Influence-Guided Self-Evolution Improves Reasoning

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing self-evolution methods rely on human annotations or heuristic rewards uncorrelated with solver improvement, limiting their effectiveness in enhancing reasoning capabilities. This work proposes INFUSER, a framework that enables co-evolution between a generator and a solver: the generator constructs question-answer pairs from unstructured documents to train the solver, using an optimizer-aware influence score—quantifying its contribution to the solver’s performance gain on the target distribution—as the reward signal. To this end, we introduce DuGRPO, a doubly normalized variant of GRPO, which facilitates adaptive curriculum generation guided by solver progress. Experiments demonstrate that INFUSER yields over 20% relative improvement on Olympiad and SuperGPQA benchmarks when applied to Qwen3-8B-Base. Notably, its 8B co-evolved generator even surpasses a fixed 32B chain-of-thought generator, achieving superior performance on mathematical and programming tasks.

📝 Abstract

Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.

Problem

Research questions and friction points this paper is trying to address.

self-evolution

reasoning

influence-guided

language model

curriculum learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

influence-guided self-evolution

co-training framework

optimizer-aware influence score