Are we really tilting? The mechanics of reward guidance in flow and diffusion models

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

230K/year
🤖 AI Summary
This work identifies the root cause of “reward hacking” in reward-guided diffusion models as the bias introduced by finite-particle plug-in estimators of the Doob h-transform. Under simplified settings with Gaussian or Gaussian mixture targets and quadratic rewards, the authors provide the first closed-form theoretical analysis that isolates two distinct failure modes induced by this estimator: intra-mode over-optimization and failure to select high-reward modes. Building on these insights, they propose a closed-form reward damping schedule that requires no additional computation. Experiments on Gaussian mixtures, 2D checkerboard densities, and text-to-image generation with FLUX.1 validate the theoretical predictions, demonstrating that the proposed method effectively mitigates reward hacking while improving sample fidelity.
📝 Abstract
Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at the cost of fidelity to the learned distribution. Prior work has attributed this to the complexity of neural reward functions or implicit biases in diffusion training, but its fundamental origins remain poorly understood. We show that reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion -- finite-particle plug-in estimation of the Doob h-function -- even in the simplest non-trivial settings of Gaussian and Gaussian mixture targets with quadratic rewards. In closed form, we isolate two distinct failure modes of the plug-in estimator: it leads to reward hacking within each mode and it cannot select high-reward modes. We propose a closed-form reward damping schedule that corrects the within-mode bias with no additional compute, and clarify the role of best-of-n sampling in compensating for the mode selection failure. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm that our theoretical insights carry over to practical settings.
Problem

Research questions and friction points this paper is trying to address.

reward hacking
reward guidance
generative models
Doob h-function
distribution fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward guidance
Doob h-function
reward hacking
diffusion models
plug-in estimation