🤖 AI Summary
In text-guided image editing, existing score distillation methods struggle to simultaneously preserve prompt fidelity and background consistency—particularly in object insertion tasks—due to severe spatial and magnitude fluctuations in gradients, resulting in high hyperparameter sensitivity and frequent editing failures. This paper proposes a fine-tuning-free localized score distillation framework featuring two novel mechanisms: (1) attention-driven spatial regularization, which leverages self-attention maps to confine edits to semantically relevant regions, and (2) gradient filtering and ℓ²-normalization, which dynamically suppresses outlier gradients and enforces gradient magnitude stability. These components jointly stabilize the optimization process. Extensive evaluations across multiple benchmarks demonstrate substantial improvements in prompt alignment and editing success rate. A user study reveals a 58–64% higher preference for our method over state-of-the-art approaches, with significantly better background preservation quality.
📝 Abstract
While diffusion models show promising results in image editing given a target prompt, achieving both prompt fidelity and background preservation remains difficult. Recent works have introduced score distillation techniques that leverage the rich generative prior of text-to-image diffusion models to solve this task without additional fine-tuning. However, these methods often struggle with tasks such as object insertion. Our investigation of these failures reveals significant variations in gradient magnitude and spatial distribution, making hyperparameter tuning highly input-specific or unsuccessful. To address this, we propose two simple yet effective modifications: attention-based spatial regularization and gradient filtering-normalization, both aimed at reducing these variations during gradient updates. Experimental results show our method outperforms state-of-the-art score distillation techniques in prompt fidelity, improving successful edits while preserving the background. Users also preferred our method over state-of-the-art techniques across three metrics, and by 58-64% overall.