Follow-Your-Preference++: Rethinking Preference Alignment for Image Inpainting

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses preference alignment in image inpainting by proposing a framework based on Direct Preference Optimization (DPO), which effectively leverages publicly available reward models to construct preference data and reveals their inherent biases in brightness, composition, and color. To mitigate reward hacking induced by these biases, the authors introduce a simple yet effective ensemble strategy combining multiple reward models, augmented with candidate expansion and calibration mechanisms to enhance alignment robustness. Experimental results demonstrate that, without modifying the model architecture or introducing new training data, the proposed method significantly outperforms state-of-the-art approaches across standard metrics, evaluations using vision-language foundation models, and human assessments. Furthermore, the study validates the transferability of preference alignment to object removal tasks.

📝 Abstract

We study preference alignment for image inpainting. Rather than proposing yet another method, we revisit the problem from first principles and reassess its core challenges. We adopt the widely used direct preference optimization framework and construct preference training data with publicly available reward models. Our empirical study spans nine reward models, two benchmarks, and two baseline inpainting models that differ in architecture and generative mechanism. Our main findings are: (1) Most reward models provide valid signals for preference data construction, although some are unreliable as evaluators. (2) Across models and benchmarks, preference data exhibits consistent trends under both candidate and sample scaling. (3) Reward models display pronounced biases--particularly in brightness, composition, and color scheme--that make them prone to inducing reward hacking. (4) A simple ensemble of reward models mitigates such biases and yields robust, generalizable performance. {\color{rebuttal_blue}(5) Preference alignment is transferable to the object removal task, where the goal shifts from open-ended creative generation to coherent background completion. (6) Further analysis reveals that a calibrated ensemble method further mitigates hacking and improves robustness.} Without modifying model architectures or introducing additional datasets, our models substantially outperform prior state-of-the-art models on standard metrics, large vision-language model evaluations, and human assessments. Our code is available at: https://github.com/shenytzzz/Follow-Your-Preference.

Problem

Research questions and friction points this paper is trying to address.

preference alignment

image inpainting

reward models

reward hacking

bias mitigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

preference alignment

reward ensemble

image inpainting