A long-form single-speaker real-time MRI speech dataset and benchmark

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

A publicly available, long-duration, single-speaker real-time MRI speech dataset—along with standardized benchmarks for speech analysis—is currently lacking. Method: We introduce the USC Long-duration Single-Speaker (LSS) dataset, comprising one hour of high-temporal-resolution (100 fps) dynamic vocal tract MRI videos synchronized with high-fidelity audio. It is the first publicly released dataset to include larynx-cropped MRI videos, sentence-level precise segmentation, denoised audio, and temporally aligned motion signals from key articulatory regions (e.g., tongue, lips, mandible). All data undergo rigorous temporal alignment and structured annotation to support multimodal, multi-task modeling. Contribution/Results: Leveraging this dataset, we establish the first unified benchmark for two core downstream tasks—articulatory speech synthesis and phoneme recognition—significantly enhancing comparability, reproducibility, and methodological rigor. This work bridges critical gaps in real-time MRI speech research: the absence of high-quality, long-duration articulatory data and standardized evaluation protocols.

Technology Category

Application Category

📝 Abstract

We release the USC Long Single-Speaker (LSS) dataset containing real-time MRI video of the vocal tract dynamics and simultaneous audio obtained during speech production. This unique dataset contains roughly one hour of video and audio data from a single native speaker of American English, making it one of the longer publicly available single-speaker datasets of real-time MRI speech data. Along with the articulatory and acoustic raw data, we release derived representations of the data that are suitable for a range of downstream tasks. This includes video cropped to the vocal tract region, sentence-level splits of the data, restored and denoised audio, and regions-of-interest timeseries. We also benchmark this dataset on articulatory synthesis and phoneme recognition tasks, providing baseline performance for these tasks on this dataset which future research can aim to improve upon.

Problem

Research questions and friction points this paper is trying to address.

Releasing a long single-speaker real-time MRI speech dataset

Providing articulatory and acoustic data for downstream tasks

Establishing baselines for articulatory synthesis and phoneme recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Released long single-speaker real-time MRI dataset

Provided derived representations for downstream tasks

Benchmarked articulatory synthesis and phoneme recognition

🔎 Similar Papers

No similar papers found.

Authors to Follow