Towards Cost-Effective Reward Guided Text Generation

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In reward-guided text generation (RGTG), high inference overhead and suboptimal token selection—caused by reward models scoring only complete sequences—are critical bottlenecks. To address these, this paper proposes a single-step, full-vocabulary parallel reward modeling framework. Our method introduces (i) a multi-candidate synchronous scoring mechanism grounded in the Bradley–Terry loss, with theoretical guarantees of improved selection consistency during autoregressive sequence expansion; and (ii) sequence extension preference learning, enabling the reward model to assign real-time scores to arbitrary prefix–candidate token pairs. Crucially, this architecture reduces reward model invocations per generation step from O(L) to O(1), drastically accelerating inference. Empirical evaluation shows our approach achieves state-of-the-art inference speed across multiple benchmarks while matching the performance of offline RLHF baselines.

Technology Category

Application Category

📝 Abstract
Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training like in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a emph{single call} to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.
Problem

Research questions and friction points this paper is trying to address.

Reduce test-time overhead in RGTG
Optimize reward model for partial sequences
Enhance inference efficiency in text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-call reward model
Bradley-Terry loss training
Efficient simultaneous token scoring
🔎 Similar Papers
No similar papers found.