DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-guided audio editing methods often rely on time-consuming inversion procedures that introduce reconstruction errors. This work proposes a novel, training-free and inversion-free approach that dynamically constructs an editing trajectory from source to target during the diffusion denoising process, enabling direct generation of the desired audio under textual guidance. To the best of our knowledge, this is the first method to achieve efficient audio editing without requiring either model retraining or inversion. Evaluated across multiple music and event-level benchmarks, the proposed approach significantly outperforms DDPM-based inversion methods, reducing the macro-averaged Fréchet Audio Distance (FAD) and KL divergence by 15.9% and 15.8%, respectively, while accelerating editing speed by up to 64.5%.
📝 Abstract
Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.
Problem

Research questions and friction points this paper is trying to address.

text-guided audio editing
inversion-free
diffusion models
audio editing
training-free
Innovation

Methods, ideas, or system contributions that make the work stand out.

inversion-free
text-guided audio editing
diffusion prediction contrast
training-free
audio editing
🔎 Similar Papers
Z
Zhengkun Ge
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Xiaoqian Liu
Xiaoqian Liu
Northeastern University, China
Speech
H
Haoran Zhang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yuan Ge
Yuan Ge
Northeastern University, China
ReasoningMultimodality LLMs
J
Junxiang Zhang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Zhengtao Yu
Zhengtao Yu
Kunming University of Science and Technology
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing
Tong Xiao
Tong Xiao
Professor in Computer Science, Northeastern University, China
Natural Language ProcessingMachine TranslationLanguage Modeling