LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address weak continuous speech prompting, poor ASR error correction, and inflexible module integration in speech encoder–large language model (LLM) co-modeling, this paper proposes LegoSLM: a modular speech–language modeling paradigm. It leverages CTC posterior probabilities as a bridge to map speech encoder outputs onto the LLM vocabulary, reconstruct weighted pseudo-audio embeddings, and concatenate them with text embeddings for LLM input. We introduce the first CTC-posterior-based modular speech–language alignment mechanism, enabling zero-shot speech encoder replacement. Additionally, a softmax temperature parameter dynamically balances contributions from speech and language representations. Evaluated on eight MLS test sets, LegoSLM achieves an average 49% word error rate reduction (WERR), significantly improving both ASR and speech translation performance. The framework further demonstrates strong cross-encoder zero-shot transferability and domain adaptability.

Technology Category

Application Category

📝 Abstract

Recently, large-scale pre-trained speech encoders and Large Language Models (LLMs) have been released, which show state-of-the-art performance on a range of spoken language processing tasks including Automatic Speech Recognition (ASR). To effectively combine both models for better performance, continuous speech prompts, and ASR error correction have been adopted. However, these methods are prone to suboptimal performance or are inflexible. In this paper, we propose a new paradigm, LegoSLM, that bridges speech encoders and LLMs using the ASR posterior matrices. The speech encoder is trained to generate Connectionist Temporal Classification (CTC) posteriors over the LLM vocabulary, which are used to reconstruct pseudo-audio embeddings by computing a weighted sum of the LLM input embeddings. These embeddings are concatenated with text embeddings in the LLM input space. Using the well-performing USM and Gemma models as an example, we demonstrate that our proposed LegoSLM method yields good performance on both ASR and speech translation tasks. By connecting USM with Gemma models, we can get an average of 49% WERR over the USM-CTC baseline on 8 MLS testsets. The trained model also exhibits modularity in a range of settings -- after fine-tuning the Gemma model weights, the speech encoder can be switched and combined with the LLM in a zero-shot fashion. Additionally, we propose to control the decode-time influence of the USM and LLM using a softmax temperature, which shows effectiveness in domain adaptation.

Problem

Research questions and friction points this paper is trying to address.

Bridging speech encoders and LLMs for better performance

Improving ASR and speech translation tasks

Enhancing modularity and flexibility in model integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Connects speech encoders and LLMs via CTC posteriors

Reconstructs pseudo-audio embeddings using LLM vocabulary

Controls decode-time influence with softmax temperature

🔎 Similar Papers

How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not