🤖 AI Summary
This work addresses the challenge of text-guided image editing, where models must faithfully follow target instructions while preserving irrelevant structural elements such as background and layout. The authors propose a training-free framework guided by reverse consistency: during the target denoising process, intermediate latent trajectories are dynamically verified for their ability to reconstruct the source image under the original prompt. Rather than using the source inversion trajectory as a fixed initialization, it serves as a dynamic constraint. An early-stage reverse consistency bias is introduced as a corrective signal, effectively decoupling structure preservation from content modification. Integrating diffusion model inversion, auxiliary noise representation, source-guided reverse denoising, and sparse consistency correction, the method significantly improves structural and background fidelity on PIE-Bench under the SD3.5 protocol while maintaining alignment with target prompts and compatibility with standard Stable Diffusion inversion pipelines.
📝 Abstract
Text-guided diffusion models have become effective tools for real-image visual editing, where the edited image must follow a target instruction while preserving editing-irrelevant structure. Most training-free editors rely on inversion: a source image is mapped to a noisy latent trajectory and the terminal latent is reused for target-prompt denoising. This reuse is useful for preservation, but it also couples source reconstruction and target editing. The resulting trajectory mismatch may either damage background/layout details or over-constrain the intended edit. This paper presents Consistent-Inversion, a training-free reverse consistency guidance framework for structure-preserving visual editing. Instead of treating the inverted source latent as a fixed initialization, Consistent-Inversion checks whether an intermediate target trajectory can be reversed toward the source inversion trajectory under the source prompt. To make this check well-defined, we construct an auxiliary target-side noise representation, perform source-guided reverse denoising, and use the resulting reverse consistency discrepancy as a correction signal for selected early target denoising steps. The method does not update model parameters, is compatible with inversion-based editors, and introduces only a small inference overhead when applied sparsely. Experiments on PIE-Bench show that Consistent-Inversion improves background and structural fidelity under a unified SD3.5 protocol while maintaining target-prompt alignment, and compatibility experiments further verify the same correction principle on classical Stable-Diffusion inversion pipelines.