🤖 AI Summary
This work addresses the issue of background leakage and unintended modifications in existing training-free image editing methods that rely on global latent space transport. To mitigate this, the authors propose a Source-Anchored Masked Flow framework that identifies editable regions using a reference image and token-grounded semantic attention maps, applying differential velocity updates only within these regions while anchoring the latent trajectories of non-target areas to the source image. A time-varying source-anchoring projection mechanism is introduced, integrating dynamic soft masks, transition-region optimization, and temporal mask accumulation to substantially enhance spatial stability and boundary naturalness. The method is plug-and-play, requiring no fine-tuning, and seamlessly adapts to mainstream flow-matching models. It achieves precise local edits while preserving background integrity, outperforming current approaches both qualitatively and quantitatively, thereby establishing a general, training-free paradigm for localized image editing.
📝 Abstract
Training-free image editing has recently attracted increasing attention due to its ability to modify real images using powerful pre-trained diffusion and flow-matching models without additional training. However, existing inversion-based and differential-flow-based methods usually perform global latent transport, which inevitably propagates editing effects to non-target regions and leads to background leakage. To address this problem, we propose SAM-Flow, a source-anchored masked flow framework for localized training-free image editing. Instead of updating the whole latent representation, SAM-Flow first uses a scout image and token-grounded attention maps to localize the editable semantic regions. It then applies differential velocity updates only within these regions, while anchoring the remaining areas to the source-image latent trajectory. To further improve spatial stability and boundary naturalness, we introduce a time-varying source-anchored projection mechanism with dynamic soft masks, transition regions, and temporal mask accumulation. The proposed method is plug-and-play and can be integrated with mainstream flow-matching backbones such as Stable Diffusion 3 and FLUX without any fine-tuning. Extensive qualitative and quantitative experiments demonstrate that SAM-Flow achieves accurate semantic editing while significantly improving background preservation, providing a simple and general localized editing paradigm for training-free image editing. Code is available at: https://github.com/chwbob/Sam-Flow.