🤖 AI Summary
Existing diffusion-based editing methods continuously rely on the source image as a conditioning signal during denoising, often leading to insufficient or unnatural edits when the target semantics substantially diverge from the input. To address this limitation, this work proposes a training-free inference-time control strategy that innovatively introduces a dual-trajectory mechanism: during denoising, the process dynamically switches to an unconditional text-to-image generation phase before reverting to the editing mode, thereby flexibly modulating the strength of source image constraints. Without modifying model weights or increasing sampling cost, the method consistently enhances instruction following, semantic fidelity, and perceptual quality across multiple diffusion models and benchmarks, while also revealing a predictable trade-off between structural preservation and editing accuracy.
📝 Abstract
Recent diffusion editors perform diverse instruction-based edits while conditioning on the source image at every denoising step. Yet persistent source-image conditioning can limit how fully an edit is executed and how natural the result appears, especially when the target scene diverges substantially from the input. We introduce DuET (Dual Expert Trajectories), a training-free inference method that temporarily relaxes source-image conditioning by transitioning through a text-to-image phase before returning to edit mode, allowing the denoising trajectory to move toward the target distribution while retaining the structural benefits of image-conditioned editing. Without modifying model weights or increasing sampling cost, DuET consistently improves instruction relevance, semantic fidelity, and perceptual quality across diverse models and benchmarks. In some cases, these gains come with a modest reduction in source-image preservation, revealing a predictable trade-off between source preservation and edit fidelity.