Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Rectified Flow (RF) models suffer from low image inversion fidelity and entangled multimodal attention. To address these challenges, this work proposes: (1) the first integration of high-order Runge–Kutta ODE solvers into the RF inversion process, substantially improving source image reconstruction fidelity; and (2) a Decoupled Diffusion Transformer Attention (DDTA) mechanism that explicitly disentangles text–image cross-attention from image self-attention, enabling fine-grained semantic editing control. By preserving RF’s computational efficiency, the method overcomes the limitations of conventional single-step or linear inversion and coupled attention architectures. Extensive experiments demonstrate state-of-the-art performance on both image reconstruction and text-guided editing tasks: PSNR improves by 2.1 dB, and editing localization error decreases by 37%.

Technology Category

Application Category

📝 Abstract
Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at https://github.com/wmchen/RKSovler_DDTA.
Problem

Research questions and friction points this paper is trying to address.

Improve inversion accuracy for rectified flow models
Disentangle multimodal attention in diffusion transformers
Enable precise semantic control in image editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Runge-Kutta solver for high-order inversion
Decoupled Diffusion Transformer Attention mechanism
Disentangles text and image multimodal attention
🔎 Similar Papers
No similar papers found.