Towards disentangling the contributions of articulation and acoustics in multimodal phoneme recognition

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior multi-speaker MRI-based speech studies were hindered by inter-subject variability, limiting fine-grained modeling of acoustic-articulatory mappings. Method: Leveraging a long-duration, real-time MRI speech corpus from a single speaker, we developed audio-only, video-only, and multimodal phoneme recognition models. We integrated deep neural networks, multimodal fusion, latent-space visualization, and attention mechanism analysis to systematically disentangle the independent temporal contributions of acoustic and articulatory modalities to phoneme recognition. Contribution/Results: Audio and multimodal models achieve comparable performance in manner-of-articulation classification but diverge significantly in place-of-articulation classification. Latent representations exhibit highly consistent phonological structure across modalities, with strong cross-modal semantic alignment yet markedly distinct attention dynamics. This work establishes a novel paradigm for modality-specific modeling of speech production mechanisms, enabling precise characterization of how acoustic and articulatory signals differentially support phonemic perception.

Technology Category

Application Category

📝 Abstract
Although many previous studies have carried out multimodal learning with real-time MRI data that captures the audio-visual kinematics of the vocal tract during speech, these studies have been limited by their reliance on multi-speaker corpora. This prevents such models from learning a detailed relationship between acoustics and articulation due to considerable cross-speaker variability. In this study, we develop unimodal audio and video models as well as multimodal models for phoneme recognition using a long-form single-speaker MRI corpus, with the goal of disentangling and interpreting the contributions of each modality. Audio and multimodal models show similar performance on different phonetic manner classes but diverge on places of articulation. Interpretation of the models' latent space shows similar encoding of the phonetic space across audio and multimodal models, while the models' attention weights highlight differences in acoustic and articulatory timing for certain phonemes.
Problem

Research questions and friction points this paper is trying to address.

Disentangling articulation and acoustics in phoneme recognition
Overcoming multi-speaker variability in MRI-based speech studies
Interpreting modality contributions in audio-visual phoneme models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses single-speaker MRI corpus
Develops unimodal and multimodal models
Analyzes phonetic space via latent encoding
🔎 Similar Papers
No similar papers found.
Sean Foley
Sean Foley
Macquarie University
Applied FinanceDigital FinanceMarket MicrostructureCryptocurrenciesDeFi
Hong Nguyen
Hong Nguyen
PhD Student at University of Southern California
Video UnderstandingMultimodality ModelsHuman-centric AIBehavioural Models
J
Jihwan Lee
Signal Analysis and Interpretation Laboratory, University of Southern California, USA
Sudarsana Reddy Kadiri
Sudarsana Reddy Kadiri
University of Southern California
Speech ProcessingBiomedical SignalsMultimodalityHealthcare InformaticsDeep Learning
D
Dani Byrd
Department of Linguistics, University of Southern California, USA
L
Louis Goldstein
Department of Linguistics, University of Southern California, USA
S
Shrikanth Narayanan
Signal Analysis and Interpretation Laboratory, University of Southern California, USA