🤖 AI Summary
This work proposes a novel reinforcement learning paradigm that operates without ground-truth labels, addressing the limitations of traditional approaches that rely on costly annotations or task-specific verifiers—particularly in settings where correctness is ambiguous or labeling is impractical. The method leverages a meta-evaluator to generate reward signals by answering natural language meta-questions (e.g., “Is the answer correct?”) and employs group relative policy optimization to update the generator. This framework enables multi-objective trade-offs, guides the model toward reliable reasoning paths, and generalizes effectively to open-domain, unlabeled scenarios. Experimental results demonstrate that the approach achieves accuracy and sample efficiency comparable to label-dependent methods across multiple tasks, successfully enabling effective training under fully unsupervised conditions.
📝 Abstract
Most reinforcement learning (RL) methods for training large language models (LLMs) require ground-truth labels or task-specific verifiers, limiting scalability when correctness is ambiguous or expensive to obtain. We introduce Reinforcement Learning from Meta-Evaluation (RLME), which optimizes a generator using reward derived from an evaluator's answers to natural-language meta-questions (e.g.,"Is the answer correct?"or"Is the reasoning logically consistent?"). RLME treats the evaluator's probability of a positive judgment as a reward and updates the generator via group-relative policy optimization, enabling learning without labels. Across a suite of experiments, we show that RLME achieves accuracy and sample efficiency comparable to label-based training, enables controllable trade-offs among multiple objectives, steers models toward reliable reasoning patterns rather than post-hoc rationalization, and generalizes to open-domain settings where ground-truth labels are unavailable, broadening the domains in which LLMs may be trained with RL.