ZIP-RC: Zero-overhead Inference-time Prediction of Reward and Cost for Adaptive and Interpretable Generation

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Large language models (LLMs) lack metacognitive capabilities, hindering dynamic self-assessment of reasoning success and computational cost—leading to inefficient, fixed-cost, and confidence-unaware test-time scaling methods (e.g., Best-of-N). To address this, we propose ZIP-RC: a zero-overhead, adaptive inference framework. Its core innovation lies in reusing the model’s original logits to jointly predict the final reward and the distribution over remaining generation length—requiring no auxiliary model or additional forward passes. ZIP-RC then constructs a joint reward-cost probability distribution and employs meta-actions to dynamically optimize sampling policies. On mathematical reasoning benchmarks, ZIP-RC achieves up to 12% absolute accuracy gain over majority voting at equal or lower computational cost. It further enables tunable trade-offs among output quality, computation, and latency, significantly improving inference efficiency and interpretability.

Technology Category

Application Category

📝 Abstract

Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.

Problem

Research questions and friction points this paper is trying to address.

Predicts reward and cost during inference without overhead

Enables adaptive generation decisions using real-time introspection

Improves accuracy and efficiency over fixed-budget sampling methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-overhead reward-cost prediction using reserved logits

Joint distribution for reward and length without extra models

Adaptive sampling utility maximizing with meta-actions

🔎 Similar Papers

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment