🤖 AI Summary
This work addresses preference alignment in image inpainting. We propose a lightweight, model-agnostic alignment framework that requires no architectural modifications or additional data collection: high-quality preference training data are synthesized from multiple publicly available reward models; systematic biases across brightness, composition, and color dimensions are rigorously characterized; and a simple yet effective multi-reward-model ensemble strategy is introduced to mitigate such biases. Alignment is achieved via direct preference optimization (DPO) applied to off-the-shelf generative models. Our approach consistently outperforms prior methods across standard quantitative metrics, GPT-4V-based evaluation, and human studies—demonstrating substantial improvements in both subjective quality and output consistency of inpainted results. The method establishes a new baseline for preference alignment that is concise, robust, and fully reproducible.
📝 Abstract
This paper investigates image inpainting with preference alignment. Instead of introducing a novel method, we go back to basics and revisit fundamental problems in achieving such alignment. We leverage the prominent direct preference optimization approach for alignment training and employ public reward models to construct preference training datasets. Experiments are conducted across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms. Our key findings are as follows: (1) Most reward models deliver valid reward scores for constructing preference data, even if some of them are not reliable evaluators. (2) Preference data demonstrates robust trends in both candidate scaling and sample scaling across models and benchmarks. (3) Observable biases in reward models, particularly in brightness, composition, and color scheme, render them susceptible to cause reward hacking. (4) A simple ensemble of these models yields robust and generalizable results by mitigating such biases. Built upon these observations, our alignment models significantly outperform prior models across standard metrics, GPT-4 assessments, and human evaluations, without any changes to model structures or the use of new datasets. We hope our work can set a simple yet solid baseline, pushing this promising frontier. Our code is open-sourced at: https://github.com/shenytzzz/Follow-Your-Preference.