Gradient-Guided Reward Optimization for Inference-time Alignment

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of large language models in maintaining alignment under distribution shift, which are often constrained by the base model’s generation quality and flaws in reward models. To overcome these challenges, the authors propose a lightweight inference-time alignment method that identifies high-risk regions via entropy-based uncertainty estimation and dynamically injects nudging tokens at the token level using gradient signals, thereby minimally perturbing the generation trajectory. Compared to conventional re-ranking paradigms, this approach substantially improves alignment quality, robustness against reward hacking, and response coverage, achieving strong performance across safety, helpfulness, and reasoning benchmarks while incurring negligible computational overhead.

📝 Abstract

Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.

Problem

Research questions and friction points this paper is trying to address.

inference-time alignment

distribution drift

reward hacking

Large Language Models

reward-guided search

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-Guided Reward Optimization

inference-time alignment

reward hacking robustness