🤖 AI Summary
Existing text-guided audio editing methods often rely on time-consuming inversion procedures that introduce reconstruction errors. This work proposes a novel, training-free and inversion-free approach that dynamically constructs an editing trajectory from source to target during the diffusion denoising process, enabling direct generation of the desired audio under textual guidance. To the best of our knowledge, this is the first method to achieve efficient audio editing without requiring either model retraining or inversion. Evaluated across multiple music and event-level benchmarks, the proposed approach significantly outperforms DDPM-based inversion methods, reducing the macro-averaged Fréchet Audio Distance (FAD) and KL divergence by 15.9% and 15.8%, respectively, while accelerating editing speed by up to 64.5%.
📝 Abstract
Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.