Parallel Tempering Initial Sampling in Inference-Time Reward Alignment

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the limitations of existing inference-time reward alignment methods, which often initialize particles from a standard prior and consequently struggle to effectively explore high-reward regions in complex, multimodal reward landscapes, frequently becoming trapped in local modes. To overcome this, the authors propose PATHS, a novel approach that employs parallel tempering to couple multiple sampling chains across a ladder of reward temperatures. By integrating Sequential Monte Carlo, Metropolis exchange moves, and adaptive temperature scheduling, PATHS enables efficient exploration of rare, high-reward regions during inference. The method substantially mitigates mode collapse and significantly improves alignment with user-specified rewards in tasks such as layout-to-image generation and quantity-aware image synthesis, demonstrating particularly strong performance under complex prompts.

📝 Abstract

Inference-time reward alignment steers pretrained diffusion and flow-based generative models to satisfy user-specified rewards without retraining. Recently, Sequential Monte Carlo (SMC) has emerged as a powerful framework for this task by iteratively filtering and propagating multiple particles. However, we show that standard SMC-based methods often suffer from poor performance because they initialize particles from a standard prior, whereas high-reward regions in complex reward landscapes are extremely rare. Further, we show that even recent reward-aware initial sampling approaches remain vulnerable to getting trapped in local modes, as complex reward landscapes are often multi-modal. To overcome these limitations, we propose PATHS (PArallel Tempering for High-complexity reward Sampling), a novel initialization method that couples multiple sampling chains through parallel tempering. PATHS maintains a ladder of reward-tempered chains and periodically performs Metropolis swaps, enabling efficient exploration across flattened reward landscapes, thereby mitigating the mode-trapping issues. Our analysis reveals that this mechanism substantially enhances the finite-budget exploration of rare, high-reward regions that are typically challenging to sample. Experiments on layout-to-image and quantity-aware generation show that PATHS achieves consistent gains in alignment quality, particularly on complex prompts.

Problem

Research questions and friction points this paper is trying to address.

inference-time reward alignment

complex reward landscapes

mode trapping

rare high-reward regions

initial sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel Tempering

Inference-time Reward Alignment

Sequential Monte Carlo