🤖 AI Summary
This work addresses the limited robustness of multimodal fusion in silent speech synthesis when sensors fail or modalities degrade. To this end, the authors propose a cross-modal masking–based multimodal speech synthesis framework that takes surface electromyography (sEMG) and lip-reading video as dual-modal inputs. By dynamically masking either modality during training, the model is encouraged to learn complementary representations across modalities. The approach integrates multi-speaker modeling within an end-to-end synthesis architecture. Experimental results demonstrate that this strategy substantially enhances generalization under missing-modality or low-bitrate conditions, reducing word error rates by up to 14 percentage points compared to the strongest single-modality baseline in multi-speaker scenarios, with particularly notable improvements for vowels and specific consonants.
📝 Abstract
Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.