🤖 AI Summary
This work addresses the challenge of deploying audio-visual speech enhancement (AVSE) across heterogeneous network environments (Ethernet, Wi-Fi, 4G/5G) and distributed architectures (cloud, edge-assisted, on-device). We propose a lightweight multimodal AVSE system based on a hybrid CNN-LSTM architecture that jointly models spectral acoustic features and temporal dynamics, while supporting cross-platform model partitioning and adaptive inference scheduling. Our key contribution is the first systematic characterization of the latency–quality trade-off in edge-cooperative AVSE: we identify the optimal operational point achieving end-to-end latency <200 ms and substantial intelligibility gains (STOI improvement ≥0.15). Under 5G/Wi-Fi 6, the edge-assisted mode strikes the best balance—outperforming cloud-only (high quality but high latency) and on-device-only (low latency but limited performance) configurations. This provides a practical, empirically grounded deployment paradigm and architectural selection guideline for real-time applications such as hearing aids and remote conferencing.
📝 Abstract
This paper introduces a new AI-based Audio-Visual Speech Enhancement (AVSE) system and presents a comparative performance analysis of different deployment architectures. The proposed AVSE system employs convolutional neural networks (CNNs) for spectral feature extraction and long short-term memory (LSTM) networks for temporal modeling, enabling robust speech enhancement through multimodal fusion of audio and visual cues. Multiple deployment scenarios are investigated, including cloud-based, edge-assisted, and standalone device implementations. Their performance is evaluated in terms of speech quality improvement, latency, and computational overhead. Real-world experiments are conducted across various network conditions, including Ethernet, Wi-Fi, 4G, and 5G, to analyze the trade-offs between processing delay, communication latency, and perceptual speech quality. The results show that while cloud deployment achieves the highest enhancement quality, edge-assisted architectures offer the best balance between latency and intelligibility, meeting real-time requirements under 5G and Wi-Fi 6 conditions. These findings provide practical guidelines for selecting and optimizing AVSE deployment architectures in diverse applications, including assistive hearing devices, telepresence, and industrial communications.