🤖 AI Summary
This work addresses the challenge of aligning deterministic one-step generative models through preference fine-tuning, particularly when policy likelihoods, denoising trajectories, or differentiable reward gradients are unavailable. The authors propose DrPO, the first online preference optimization method for one-step models that operates without requiring reward gradients. DrPO performs online sampling of candidate images, ranks them according to a target reward, and synthesizes update directions in feature space using high- and low-scoring samples—bypassing backpropagation entirely. Its core innovations include a non-parametric dipole preference field, reference drift estimation, and a ranking-based sample selection strategy. Experiments demonstrate that DrPO significantly improves alignment with benchmarks such as HPSv3 on SD-Turbo and SDXL-Turbo while reducing training compute by 3.51×, and further validate the potential of synthetic sample gradients in offline settings.
📝 Abstract
One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denoising trajectories, differentiable reward gradients, or test-time optimization. We propose Drifting Preference Optimization (DrPO), an online preference-finetuning method for deterministic one-step generators. For each prompt, DrPO samples candidates from the current generator, ranks them with a target reward, and uses high- and low-scoring samples to synthesize a feature-space update direction. The update is a non-parametric dipole preference field plus a reference drift estimated from the frozen base generator, and is optimized through a detached feature-space regression target. The target reward is used only for ranking, so DrPO can train with large, black-box, or non-differentiable rewards while inference remains a single generator call. We evaluate DrPO on SD-Turbo and SDXL-Turbo with multiple target rewards and benchmarks, including HPSv3 and GenEval. DrPO improves alignment over reward-gradient-free one-step preference baselines and reduces HPSv3 training computation by $3.51\times$ under the matched effective-batch setting by removing reward-model backpropagation. Initial offline experiments suggest that sample-based gradient synthesis can also be used beyond online reward ranking.