🤖 AI Summary
State-of-the-art pretrained speech models exhibit strong performance but display a representational hierarchy misaligned with human auditory processing: mid-layer representations are semantically rich, whereas late layers lack semantic specificity.
Method: We propose a cross-modal fine-tuning framework guided by fMRI and EEG neural signals, the first to align self-supervised speech models (e.g., wav2vec 2.0) with intermediate stages of human cortical speech processing. Layer-wise probing and Representational Similarity Analysis (RSA) are employed for validation.
Results: Post-neurofeedback fine-tuning significantly enhances functional specialization across layers: early layers specialize in acoustic features, while late layers robustly encode high-level semantics—establishing a clear “acoustic → semantic” hierarchical progression. Crucially, late-layer representations show markedly improved functional alignment with canonical semantic brain regions. Beyond performance gains, our approach yields a biologically plausible, interpretable, and empirically verifiable speech processing architecture, advancing computational neuroscience with a novel, neuroscientifically grounded modeling paradigm.
📝 Abstract
Pretrained self-supervised speech models excel in speech tasks but do not reflect the hierarchy of human speech processing, as they encode rich semantics in middle layers and poor semantics in late layers. Recent work showed that brain-tuning (fine-tuning models using human brain recordings) improves speech models' semantic understanding. Here, we examine how well brain-tuned models further reflect the brain's intermediate stages of speech processing. We find that late layers of brain-tuned models substantially improve over pretrained models in their alignment with semantic language regions. Further layer-wise probing reveals that early layers remain dedicated to low-level acoustic features, while late layers become the best at complex high-level tasks. These findings show that brain-tuned models not only perform better but also exhibit a well-defined hierarchical processing going from acoustic to semantic representations, making them better model organisms for human speech processing.