LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

📅 2024-12-19

🏛️ AAAI Conference on Artificial Intelligence

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Multilingual ASR has long suffered from imbalanced cross-lingual performance. This paper proposes the first language-agnostic, module-free end-to-end universal multilingual ASR framework, based on a two-stage orthographic unification: (1) transcribing multilingual speech into a Romanized Universal Grapheme Phoneme (UGP) representation, followed by (2) reversible transliteration to target-language orthography. The method integrates UGP generation, language-agnostic phoneme modeling, and unsupervised orthographic alignment. Trained on only 0.1% of Whisper’s data, it matches Whisper’s performance and enables zero-shot recognition of unseen languages. On major benchmarks, it achieves a 45% relative word error rate reduction over Whisper, attaining performance comparable to MMS and dictionary-augmented zero-shot methods—while eliminating all language-specific components. This significantly enhances model generalizability and deployment simplicity.

Technology Category

Application Category

📝 Abstract

Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.

Problem

Research questions and friction points this paper is trying to address.

Building universal multilingual ASR models with equitable performance across languages

Overcoming challenges of language-specific modules in multilingual speech recognition

Achieving high accuracy with minimal training data compared to existing models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies orthographies into Romanized phonetic form

Converts universal transcriptions to language-specific ones

Achieves multilingual ASR without language-specific modules

🔎 Similar Papers

No similar papers found.

Authors to Follow