π€ AI Summary
This work addresses the challenge of reconstructing intelligible continuous speech from non-invasive neural signals, which are typically corrupted by high noise levels, spatial blurring, and incomplete linguistic information. To overcome these limitations, the authors propose MindVoice, a novel framework that, for the first time, leverages pretrained priors to decouple speech reconstruction into two parallel pathways: semantic content recovery and acoustic attribute estimation. By integrating state-of-the-art neural speech synthesis models with contextual voice cloning techniques, MindVoice generates speech that is both highly intelligible and natural-sounding. Evaluated on both EEG and MEG datasets, the proposed method substantially outperforms existing approaches, achieving significant improvements in key metrics including speech intelligibility and naturalness.
π Abstract
Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.