🤖 AI Summary
Standard Metropolis–Hastings (MH) sampling suffers from high computational cost and slow mixing when drawing samples from sequence-level power distributions \(p^\alpha\), due to a mismatch between its uniform proposal strategy and the spatial sparsity of high-entropy critical regions. This work proposes Entropy-Guided Power Sampling (EGPS), which, for the first time, incorporates token-level entropy into the MCMC proposal mechanism. By guiding local resampling toward uncertain segments and skipping deterministic ones, EGPS focuses computational effort where it is most needed. The method requires no training or external verifier, and its computational overhead scales with entropy magnitude rather than sequence length. Built upon the Multiple-Try Metropolis framework, EGPS achieves state-of-the-art or competitive performance on Qwen2.5-Math-7B across MATH500 (75.8%), HumanEval (62.2%), and GPQA (42.4%), offering up to a 12.6× speedup over the MH baseline.
📝 Abstract
Sampling from the sequence-level power distribution $p^α$ elicits RL-level reasoning from base language models without any parameter updates, but the standard Metropolis--Hastings (MH), a Markov Chain Monte Carlo (MCMC) sampler, is both expensive and slow-mixing. We trace both to a structural mismatch: $p^α$ mainly departs from $p$ at a sparse, spatially clustered set of high-entropy decision points, yet MH proposes resampling positions uniformly along the prefix -- wasting compute on near-degenerate conditionals while under-mixing precisely where modes diverge. We propose Entropy-Guided Power Sampling (EGPS), a training-free and verifier-free sampler that re-derives its proposal from token-level entropy already in the forward pass. EGPS skips deterministic blocks, localizes each MCMC move to a high-entropy neighborhood, and applies Multiple-Try Metropolis at decision points -- making sampling cost scale with \emph{entropy mass rather than sequence length}. On Qwen2.5-Math-7B, EGPS reaches best or tied-best accuracy on all three benchmarks (MATH500 $75.8\%$, HumanEval $62.2\%$, GPQA $42.4\%$) at up to a $12.6\times$ wall-clock speedup over the MH baseline.