ARTI-6: Towards Six-dimensional Articulatory Speech Encoding

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high dimensionality and poor interpretability of vocal tract modeling in articulatory inversion and speech synthesis, this paper proposes the first six-dimensional articulatory speech coding framework grounded in real-time MRI data, precisely characterizing dynamic movements of key vocal tract regions—including the soft palate, tongue root, and larynx. Methodologically, we design a physiologically interpretable and computationally efficient low-dimensional encoding scheme; introduce speech foundation models to articulatory inversion for the first time; and develop an end-to-end acoustic–articulatory bidirectional mapping model. Experiments demonstrate a high articulatory inversion correlation of 0.87 and show that intelligible, high-fidelity speech can be reconstructed using only six latent dimensions—validating the sufficiency of ultra-low-dimensional representation. The code and speech samples are publicly released, establishing a new paradigm for physiological speech modeling, cross-modal speech generation, and clinical speech technologies.

Technology Category

Application Category

📝 Abstract
We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications. The source code and speech samples are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Develops a compact six-dimensional articulatory speech encoding framework
Predicts articulatory features from speech acoustics using foundation models
Reconstructs intelligible speech directly from low-dimensional articulatory features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Six-dimensional articulatory speech encoding from MRI data
Articulatory inversion predicts features from speech acoustics
Articulatory synthesis reconstructs speech from articulatory features
🔎 Similar Papers
No similar papers found.
J
Jihwan Lee
Signal Analysis and Interpretation Lab, University of Southern California
Sean Foley
Sean Foley
Macquarie University
Applied FinanceDigital FinanceMarket MicrostructureCryptocurrenciesDeFi
T
Thanathai Lertpetchpun
Signal Analysis and Interpretation Lab, University of Southern California
K
Kevin Huang
Signal Analysis and Interpretation Lab, University of Southern California
Y
Yoonjeong Lee
Signal Analysis and Interpretation Lab, University of Southern California
Tiantian Feng
Tiantian Feng
Postdoc Researcher
Health and BehaviorsWearable ComputingAffective ComputingSpeech and BiosignalResponsible ML
L
Louis Goldstein
Department of Linguistics, University of Southern California
D
Dani Byrd
Department of Linguistics, University of Southern California
S
Shrikanth Narayanan
Signal Analysis and Interpretation Lab, University of Southern California