ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work proposes a training-free, inference-time alignment method for language models that circumvents the high cost and instability of conventional reinforcement learning (RL) approaches. By leveraging energy-based guidance, the method directly samples from the optimal RL policy using the transition probability structure of masked language models. Key innovations include the first demonstration of training-free RL alignment, the introduction of an online Monte Carlo estimator for the energy term, and the integration of importance sampling with modern inference acceleration frameworks to enhance efficiency without compromising sample quality. Experimental results show significant improvements in generation quality across reasoning, programming, and scientific tasks, demonstrating the method’s effectiveness and broad applicability.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Language Model Alignment

Training-Free Inference

Test-Time Scaling

Energy-Based Sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free alignment

energy-guided sampling

test-time scaling