Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of insufficient word-level forced alignment accuracy in low-resource and unseen languages by proposing a multilingual alignment approach that integrates self-supervised speech representations with learnable dynamic time warping. The method introduces, for the first time, a learnable dynamic programming framework for word boundary inference, jointly leveraging representations from the MMS model and the UnSupSeg boundary detector. An iterative training mechanism further enhances alignment performance. Notably, the approach generalizes to over 1,100 languages without additional training and outperforms baseline systems—including the Montreal Forced Aligner and MMS—on the TIMIT and Buckeye datasets. It also achieves comparable or superior results on unseen languages such as Dutch, German, and Hebrew.

📝 Abstract

We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.

Problem

Research questions and friction points this paper is trying to address.

multilingual

word-level forced alignment

self-supervised representations

dynamic programming

speech processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

forced alignment

self-supervised learning

multilingual speech