LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

📅 2024-12-19
🏛️ AAAI Conference on Artificial Intelligence
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multilingual ASR has long suffered from imbalanced cross-lingual performance. This paper proposes the first language-agnostic, module-free end-to-end universal multilingual ASR framework, based on a two-stage orthographic unification: (1) transcribing multilingual speech into a Romanized Universal Grapheme Phoneme (UGP) representation, followed by (2) reversible transliteration to target-language orthography. The method integrates UGP generation, language-agnostic phoneme modeling, and unsupervised orthographic alignment. Trained on only 0.1% of Whisper’s data, it matches Whisper’s performance and enables zero-shot recognition of unseen languages. On major benchmarks, it achieves a 45% relative word error rate reduction over Whisper, attaining performance comparable to MMS and dictionary-augmented zero-shot methods—while eliminating all language-specific components. This significantly enhances model generalizability and deployment simplicity.

Technology Category

Application Category

📝 Abstract
Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.
Problem

Research questions and friction points this paper is trying to address.

Building universal multilingual ASR models with equitable performance across languages
Overcoming challenges of language-specific modules in multilingual speech recognition
Achieving high accuracy with minimal training data compared to existing models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies orthographies into Romanized phonetic form
Converts universal transcriptions to language-specific ones
Achieves multilingual ASR without language-specific modules
🔎 Similar Papers
No similar papers found.
S
Sangmin Lee
Dept. of Electrical & Electronic Engineering, Yonsei University, South Korea
W
Woo-Jin Chung
Dept. of Electrical & Electronic Engineering, Yonsei University, South Korea
Hong-Goo Kang
Hong-Goo Kang
Yonsei University