🤖 AI Summary
Existing image customization methods encode both subject identity and artistic style into a single shared embedding, leading to entanglement that compromises customization fidelity. To address this limitation, this work proposes a frequency-aware diffusion model that introduces frequency-domain decomposition into image customization for the first time: low-frequency components are dedicated to modeling subject content, while high-frequency components capture stylistic attributes, with each optimized through separate embeddings. The approach further integrates mask-guided diffusion and a residual reference attention (RRA) mechanism to preserve structural consistency of the subject while enhancing alignment with textual prompts. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art baselines in subject fidelity, text adherence, and generalization to unseen styles.
📝 Abstract
Image customization learns target subjects from reference concept images and generates conditioned images per text prompts, mainly modifying styles or backgrounds. Prevailing methods adopt fine-tuning to pack diverse concept attributes into a unified latent embedding, yet entangled attributes hinder elimination of irrelevant disturbances from style and background. To address this issue, we propose Equilibrated Diffusion, a frequency-driven approach that disentangles tangled concept features for balanced customization and consistent text-visual matching. Unlike conventional methods learning full concepts with shared embeddings and unified tuning, our work utilizes the inherent link between image frequency components and semantics: low frequencies represent subject content and high frequencies correspond to styles. We decompose concepts in frequency space and optimize each embedding independently. This separate optimization enables the denoiser to capture style detached from subject identity and generalize better to unseen stylistic prompts. Merging multi-frequency embeddings preserves the model's original spatial customization ability. We further deploy mask-guided diffusion to restrict irrelevant background changes and boost text alignment. Residual Reference Attention (RRA) is inserted into spatial attention to retain subject structure and identity consistency. Experiments prove Equilibrated Diffusion exceeds mainstream baselines on subject fidelity and text adherence, verifying our method's superiority.