Training-free image inversion for one-step diffusion models

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
This work addresses the efficiency and quality limitations of single-step diffusion models in real image inversion and editing, which stem from weak editability of initial latent variables and insufficient text-image alignment. To overcome these challenges, the authors propose TFinv, a training-free inversion framework that achieves, for the first time, training-free single-step diffusion inversion. TFinv leverages iterative noise alignment (iterNA) and learnable suffix prompts (suffL) to accurately map input images into the initial noise space while enabling mask-guided local editing. Evaluated on PIE-Bench, TFinv attains state-of-the-art performance among single-step diffusion editing methods, significantly outperforming various multi-step approaches in both editing quality and controllability, all while maintaining high inference efficiency.
📝 Abstract
In this work, we introduce a novel training-free inversion (TFinv) framework for one-step diffusion models,addressing key challenges in real image inversion and editing. We first identify two critical factors hamperingreal-image inversion and editing: (1) Initial Latent Editability, which is related to the distance between theinitial noise and the ideal Gaussian distribution, and (2) Caption Gap, which means the alignment betweentext captions and image representations. Both factors influence inversion efficiency and the editability ofone-step diffusion models. Then, we propose two novel techniques: iterative noise alignment (iterNA), whichminimizes the distribution gap to align with the normal Gaussian distribution, and suffix learning (suffL),which enhances text-to-image caption alignment by introducing learned suffix prompt tokens. These techniquesenable precise inversion of input images into their initial noise representations and facilitate image editing.Furthermore, we propose a mask-based editing technique for localized edits while preserving backgroundintegrity. Comprehensive experiments on the PIE-Bench dataset validate that our method TFinv not onlyachieves state-of-the-art performance in one-step diffusion editing, but also significantly outperforms existingmultistep approaches in efficiency. The code is available at https://github.com/tttao-uwu/TFinv.git.
Problem

Research questions and friction points this paper is trying to address.

image inversion
one-step diffusion models
initial latent editability
caption gap
real image editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free inversion
one-step diffusion models
iterative noise alignment
suffix learning
mask-based editing