Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of deploying audio-visual speech enhancement (AVSE) across heterogeneous network environments (Ethernet, Wi-Fi, 4G/5G) and distributed architectures (cloud, edge-assisted, on-device). We propose a lightweight multimodal AVSE system based on a hybrid CNN-LSTM architecture that jointly models spectral acoustic features and temporal dynamics, while supporting cross-platform model partitioning and adaptive inference scheduling. Our key contribution is the first systematic characterization of the latency–quality trade-off in edge-cooperative AVSE: we identify the optimal operational point achieving end-to-end latency <200 ms and substantial intelligibility gains (STOI improvement ≥0.15). Under 5G/Wi-Fi 6, the edge-assisted mode strikes the best balance—outperforming cloud-only (high quality but high latency) and on-device-only (low latency but limited performance) configurations. This provides a practical, empirically grounded deployment paradigm and architectural selection guideline for real-time applications such as hearing aids and remote conferencing.

Technology Category

Application Category

📝 Abstract
This paper introduces a new AI-based Audio-Visual Speech Enhancement (AVSE) system and presents a comparative performance analysis of different deployment architectures. The proposed AVSE system employs convolutional neural networks (CNNs) for spectral feature extraction and long short-term memory (LSTM) networks for temporal modeling, enabling robust speech enhancement through multimodal fusion of audio and visual cues. Multiple deployment scenarios are investigated, including cloud-based, edge-assisted, and standalone device implementations. Their performance is evaluated in terms of speech quality improvement, latency, and computational overhead. Real-world experiments are conducted across various network conditions, including Ethernet, Wi-Fi, 4G, and 5G, to analyze the trade-offs between processing delay, communication latency, and perceptual speech quality. The results show that while cloud deployment achieves the highest enhancement quality, edge-assisted architectures offer the best balance between latency and intelligibility, meeting real-time requirements under 5G and Wi-Fi 6 conditions. These findings provide practical guidelines for selecting and optimizing AVSE deployment architectures in diverse applications, including assistive hearing devices, telepresence, and industrial communications.
Problem

Research questions and friction points this paper is trying to address.

Develops AI-based AVSE system for robust speech enhancement
Compares deployment architectures for latency and quality trade-offs
Evaluates performance in real-world network conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

CNN and LSTM for multimodal speech enhancement
Cloud, edge, and standalone deployment analysis
Real-time optimization under 5G and Wi-Fi 6
🔎 Similar Papers
No similar papers found.