🤖 AI Summary
This work addresses the challenge of effectively disentangling shared and salient factors between two data distributions in high-fidelity image generation. The authors propose a novel conditional diffusion model framework that operates without textual prompts and leverages weak supervision to decompose image conditions into identifiable, additive contrastive factors. For the first time, contrastive decomposition is integrated into diffusion models, and the authors provide theoretical guarantees that such decomposition is identifiable under mild conditions, enabling precise manipulation—such as swapping or interpolating—of only the salient factors. This approach achieves both high-quality image synthesis and fine-grained editing, substantially enhancing the practical utility of factor disentanglement for comparative analysis of high-resolution images.
📝 Abstract
Contrastive Analysis aims to separate factors that are common between two data distributions from those that are salient to only one of them. Existing contrastive methods are based on generative models (e.g., VAEs or GANs) that often suffer from limited reconstruction and image quality, which hampers effective latent factor separation and limits their applicability to high-fidelity image generation and edition. We propose a novel conditioning framework for diffusion models that enables contrastive decomposition without compromising generation quality. We first train a prompt-free, image-conditioned diffusion model, and then learn to decompose the conditioning into a common and a salient factor, using weak supervision. We prove that the additive contrastive factorization, commonly assumed in prior work, is identifiable under mild conditions. This factorization enables targeted operations by swapping or interpolating only the salient factor.