🤖 AI Summary
Free-form text-guided audio editing remains hindered by the slow reverse diffusion process. This paper introduces the Virtual Consistency Framework—a model-agnostic approach that enables efficient, text-driven audio editing without model fine-tuning, architectural modification, or explicit inversion. Instead, it integrates text-conditioned embeddings directly into the diffusion sampling iterations and enforces virtual consistency constraints to preserve the original audio’s structural integrity, balancing fidelity and edit controllability. The method is compatible with any pre-trained diffusion-based audio model. Experiments demonstrate a several-fold speedup over existing neural editing methods. It achieves superior performance in both subjective evaluation (MOS: 4.21) and objective metrics (STOI: 0.93, ESTOI: 0.87), significantly outperforming baseline approaches. These results are further validated through a user study involving 16 participants.
📝 Abstract
Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches rely on slow inversion procedures, limiting their practicality. We present a virtual-consistency based audio editing system that bypasses inversion by adapting the sampling process of diffusion models. Our pipeline is model-agnostic, requiring no fine-tuning or architectural changes, and achieves substantial speed-ups over recent neural editing baselines. Crucially, it achieves this efficiency without compromising quality, as demonstrated by quantitative benchmarks and a user study involving 16 participants.