🤖 AI Summary
To address the challenge of jointly modeling diverse types and severities of distortions in speech restoration, this paper proposes an acoustic context-aware generative restoration method. The core innovation is the Acoustic Context (ACX) representation, which adaptively refines CLAP-based acoustic embeddings to better capture distortion characteristics—thereby reducing reliance on linguistic or speaker-specific content inherent in conventional approaches. Built upon the UNIVERSE++ diffusion architecture, the method incorporates ACX embeddings as conditional guidance signals to enable environment-aware, end-to-end restoration. Experiments demonstrate substantial improvements across multiple noise and distortion conditions: PESQ increases by 1.2, STOI by 3.8%, while output variance decreases by 37%. The proposed approach thus achieves significantly enhanced restoration consistency and robustness.
📝 Abstract
This paper introduces a novel approach to speech restoration by integrating a context-related conditioning strategy. Specifically, we employ the diffusion-based generative restoration model, UNIVERSE++, as a backbone to evaluate the effectiveness of contextual representations. We incorporate acoustic context embeddings extracted from the CLAP model, which capture the environmental attributes of input audio. Additionally, we propose an Acoustic Context (ACX) representation that refines CLAP embeddings to better handle various distortion factors and their intensity in speech signals. Unlike content-based approaches that rely on linguistic and speaker attributes, ACX provides contextual information that enables the restoration model to distinguish and mitigate distortions better. Experimental results indicate that context-aware conditioning improves both restoration performance and its stability across diverse distortion conditions, reducing variability compared to content-based methods.