🤖 AI Summary
To address the high dimensionality and poor interpretability of vocal tract modeling in articulatory inversion and speech synthesis, this paper proposes the first six-dimensional articulatory speech coding framework grounded in real-time MRI data, precisely characterizing dynamic movements of key vocal tract regions—including the soft palate, tongue root, and larynx. Methodologically, we design a physiologically interpretable and computationally efficient low-dimensional encoding scheme; introduce speech foundation models to articulatory inversion for the first time; and develop an end-to-end acoustic–articulatory bidirectional mapping model. Experiments demonstrate a high articulatory inversion correlation of 0.87 and show that intelligible, high-fidelity speech can be reconstructed using only six latent dimensions—validating the sufficiency of ultra-low-dimensional representation. The code and speech samples are publicly released, establishing a new paradigm for physiological speech modeling, cross-modal speech generation, and clinical speech technologies.
📝 Abstract
We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications. The source code and speech samples are publicly available.