🤖 AI Summary
This work addresses the challenge of efficiently steering pretrained generative models to achieve high-fidelity, semantically aligned image editing and synthesis without requiring fine-tuning or image inversion. To this end, it introduces the first unified multi-reward optimization framework that jointly optimizes multiple objectives—including semantic alignment, perceptual fidelity, spatial localization, object consistency, and human preferences—during inference via multi-reward Langevin dynamics. Key innovations include a differentiable reward mechanism grounded in visual question answering (VQA) and a prompt-aware adaptive weighting strategy that enables fine-grained language-to-vision semantic supervision. The proposed method achieves state-of-the-art performance in both editing fidelity and compositional alignment across multiple image editing and compositional generation benchmarks.
📝 Abstract
We introduce RewardFlow, an inversion-free framework that steers pretrained diffusion and flow-matching models at inference time through multi-reward Langevin dynamics. RewardFlow unifies complementary differentiable rewards for semantic alignment, perceptual fidelity, localized grounding, object consistency, and human preference, and further introduces a differentiable VQA-based reward that provides fine-grained semantic supervision through language-vision reasoning. To coordinate these heterogeneous objectives, we design a prompt-aware adaptive policy that extracts semantic primitives from the instruction, infers edit intent, and dynamically modulates reward weights and step sizes throughout sampling. Across several image editing and compositional generation benchmarks, RewardFlow delivers state-of-the-art edit fidelity and compositional alignment.