Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of deploying audio-visual speech enhancement (AVSE) across heterogeneous network environments (Ethernet, Wi-Fi, 4G/5G) and distributed architectures (cloud, edge-assisted, on-device). We propose a lightweight multimodal AVSE system based on a hybrid CNN-LSTM architecture that jointly models spectral acoustic features and temporal dynamics, while supporting cross-platform model partitioning and adaptive inference scheduling. Our key contribution is the first systematic characterization of the latency–quality trade-off in edge-cooperative AVSE: we identify the optimal operational point achieving end-to-end latency <200 ms and substantial intelligibility gains (STOI improvement ≥0.15). Under 5G/Wi-Fi 6, the edge-assisted mode strikes the best balance—outperforming cloud-only (high quality but high latency) and on-device-only (low latency but limited performance) configurations. This provides a practical, empirically grounded deployment paradigm and architectural selection guideline for real-time applications such as hearing aids and remote conferencing.

Technology Category

Application Category

📝 Abstract

This paper introduces a new AI-based Audio-Visual Speech Enhancement (AVSE) system and presents a comparative performance analysis of different deployment architectures. The proposed AVSE system employs convolutional neural networks (CNNs) for spectral feature extraction and long short-term memory (LSTM) networks for temporal modeling, enabling robust speech enhancement through multimodal fusion of audio and visual cues. Multiple deployment scenarios are investigated, including cloud-based, edge-assisted, and standalone device implementations. Their performance is evaluated in terms of speech quality improvement, latency, and computational overhead. Real-world experiments are conducted across various network conditions, including Ethernet, Wi-Fi, 4G, and 5G, to analyze the trade-offs between processing delay, communication latency, and perceptual speech quality. The results show that while cloud deployment achieves the highest enhancement quality, edge-assisted architectures offer the best balance between latency and intelligibility, meeting real-time requirements under 5G and Wi-Fi 6 conditions. These findings provide practical guidelines for selecting and optimizing AVSE deployment architectures in diverse applications, including assistive hearing devices, telepresence, and industrial communications.

Problem

Research questions and friction points this paper is trying to address.

Develops AI-based AVSE system for robust speech enhancement

Compares deployment architectures for latency and quality trade-offs

Evaluates performance in real-world network conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

CNN and LSTM for multimodal speech enhancement

Cloud, edge, and standalone deployment analysis

Real-time optimization under 5G and Wi-Fi 6

🔎 Similar Papers

No similar papers found.

Authors to Follow