π€ AI Summary
This work proposes a training-free, inference-time alignment method for language models that circumvents the high cost and instability of conventional reinforcement learning (RL) approaches. By leveraging energy-based guidance, the method directly samples from the optimal RL policy using the transition probability structure of masked language models. Key innovations include the first demonstration of training-free RL alignment, the introduction of an online Monte Carlo estimator for the energy term, and the integration of importance sampling with modern inference acceleration frameworks to enhance efficiency without compromising sample quality. Experimental results show significant improvements in generation quality across reasoning, programming, and scientific tasks, demonstrating the methodβs effectiveness and broad applicability.
π Abstract
Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.