ZIPA: A family of efficient models for multilingual phone recognition

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cross-lingual phoneme recognition models struggle to model sociophonetic diversity and suffer from low computational efficiency. To address these limitations, we propose ZIPA—a family of efficient models trained on IPAPack++, a 17,132-hour multilingual speech corpus—and introduce the first benchmark evaluating phoneme recognition across sociophonetic variation. ZIPA comprises two lightweight architectures—ZIPA-T (an RNN-Transducer) and ZIPA-CR (a CTC-based model)—both built upon the compact Zipformer backbone. We further enhance training via self-supervised pretraining and noise-augmented student distillation to generate 11,000 hours of high-quality pseudo-labels. Experiments demonstrate that ZIPA significantly outperforms prior state-of-the-art methods on multilingual phoneme recognition benchmarks, with substantially reduced parameter counts and strong generalization to unseen languages and dialectal variants. Nevertheless, robust modeling of sociophonetic diversity remains an open challenge.

Technology Category

Application Category

📝 Abstract
We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.
Problem

Research questions and friction points this paper is trying to address.

Advance multilingual phone recognition performance efficiently
Address sociophonetic diversity modeling limitations
Scale training with pseudo-labeled multilingual data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Zipformer backbone for multilingual recognition
Large-scale IPAPack++ corpus with normalized transcriptions
Noisy student training with pseudo-labeled multilingual data
🔎 Similar Papers
No similar papers found.