🤖 AI Summary
Existing audio inpainting methods exhibit significantly degraded coherence and musicality when reconstructing long gaps (e.g., ≥100 ms, up to 300 ms). To address this, we propose a zero-shot conditional audio inpainting framework built upon an unconditional diffusion model—the first such application to zero-shot audio inpainting. Our method introduces a pitch-equivariant network architecture formulated in the Constant-Q Transform (CQT) domain, enabling explicit modeling of pitch structure and thereby enhancing time-frequency consistency and musical naturalness in long-gap reconstruction. Experiments demonstrate state-of-the-art performance across gap lengths of 50–300 ms, outperforming mainstream baselines on both objective metrics and subjective listening evaluations. Notably, for 300 ms gaps, our approach yields superior perceptual quality while preserving high fidelity and temporal continuity.
📝 Abstract
Audio inpainting aims to reconstruct missing segments in corrupted recordings. Most of existing methods produce plausible reconstructions when the gap lengths are short, but struggle to reconstruct gaps larger than about 100 ms. This paper explores recent advancements in deep learning and, particularly, diffusion models, for the task of audio inpainting. The proposed method uses an unconditionally trained generative model, which can be conditioned in a zero-shot fashion for audio inpainting, and is able to regenerate gaps of any size. An improved deep neural network architecture based on the constant-Q transform, which allows the model to exploit pitch-equivariant symmetries in audio, is also presented. The performance of the proposed algorithm is evaluated through objective and subjective metrics for the task of reconstructing short to mid-sized gaps, up to 300 ms. The results of a formal listening test show that the proposed method delivers comparable performance against the compared baselines for short gaps, such as 50 ms, while retaining a good audio quality and outperforming the baselines for wider gaps that are up to 300 ms long. The method presented in this paper can be applied to restoring sound recordings that suffer from severe local disturbances or dropouts, which must be reconstructed.