Vision-Integrated High-Quality Neural Speech Coding

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of neural speech codecs in reconstructing high-fidelity speech and maintaining noise robustness at low bitrates. We propose the first end-to-end vision-guided neural speech coding framework. To enable audio-visual co-modeling without increasing bitrate (fixed at 16 kbps), our method employs dual mechanisms: explicit lip-feature fusion and implicit audio-visual knowledge distillation. The architecture comprises an image analysis-synthesis module, a switchable feature fusion module, and a joint audio-visual training framework, integrated with a lightweight neural vocoder. Experiments across multiple noisy conditions demonstrate significant improvements: PESQ increases by ≥1.2, and STOI improves by 3.5% over audio-only baseline models—achieving superior perceptual quality and intelligibility without requiring additional bandwidth.

Technology Category

Application Category

📝 Abstract
This paper proposes a novel vision-integrated neural speech codec (VNSC), which aims to enhance speech coding quality by leveraging visual modality information. In VNSC, the image analysis-synthesis module extracts visual features from lip images, while the feature fusion module facilitates interaction between the image analysis-synthesis module and the speech coding module, transmitting visual information to assist the speech coding process. Depending on whether visual information is available during the inference stage, the feature fusion module integrates visual features into the speech coding module using either explicit integration or implicit distillation strategies. Experimental results confirm that integrating visual information effectively improves the quality of the decoded speech and enhances the noise robustness of the neural speech codec, without increasing the bitrate.
Problem

Research questions and friction points this paper is trying to address.

Enhance speech coding quality using visual modality information
Integrate lip image features to assist speech coding process
Improve decoded speech quality and noise robustness without increasing bitrate
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-integrated neural speech codec (VNSC)
Extracts visual features from lip images
Explicit integration or implicit distillation strategies
🔎 Similar Papers
No similar papers found.
Yao Guo
Yao Guo
Beijing Institute of Technology
Nanodevices
Yang Ai
Yang Ai
Associate Researcher, University of Science and Technology of China
Speech SynthesisSpeech EnhancementSpeech CodingDeep Learning
R
Rui-Chen Zheng
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, P. R. China
H
Hui-Peng Du
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, P. R. China
X
Xiao-Hang Jiang
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, P. R. China
Z
Zhenhua Ling
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, P. R. China