🤖 AI Summary
This work addresses the challenge of post-training large language models for open-ended instruction following, where reliable reward signals—such as human annotations—are often costly or unavailable. To overcome this limitation, the authors propose Cross-Model Entropy (CME) as an unsupervised, label-free reward signal that quantifies response quality by measuring the average log-likelihood of a generator’s outputs under independent validator models. Integrated into the GRPO algorithm, CME constitutes the first unsupervised reward mechanism that requires no additional training, provides continuous feedback, and is robust against self-consistency exploitation. Empirical evaluations on UltraFeedback and AlpacaEval 2.0 demonstrate that CME consistently outperforms baseline methods across four model families and three training configurations, achieving win rates of 52.5%–71.4% after tie adjustment.
📝 Abstract
Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness checks (e.g., mathematics, code execution), or human preference labels, which are expensive to collect and prone to reward hacking. Recent label-free methods replace ground-truth verifiers with self-referential signals like majority voting or token entropy over a model's own outputs, but risk reinforcing a model's own errors. In this work we propose Cross-Model Entropy (CME), the mean log-likelihood of a generator's response under a separate verifier model, as a label-free reward signal for RL post-training. CME is continuous, training-free, and grounded in the principle that responses a verifier finds unsurprising are likely correct or high quality. Because the verifier is independent of the generator, the signal cannot be gamed through self-consistency. We integrate CME into GRPO with no other changes to the training loop, extending label-free RL to open-ended instruction following -- a regime where self-referential signals are inapplicable or poorly suited. On open-ended instruction following (UltraFeedback prompts, evaluated on AlpacaEval 2.0), CME rewards beat the untrained base in head-to-head LLM-as-Judge comparisons across four model families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), with tie-adjusted win rates ranging from 52.5% to 71.4%. Code will be released upon publication.