From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the dual challenge of modality fusion and performance preservation in speech transcription and translation by extending monolingual text-based large language models (e.g., TOWER) to multilingual speech–text unified models. Methodologically, it pioneers treating discretized speech—specifically SoundStream token sequences—as a “(N+1)-th language” integrated into a multilingual LLM architecture. Through multi-stage continual pretraining and cross-lingual speech–text alignment modeling, the open-source model SPIRE is developed. Experiments demonstrate that SPIRE achieves state-of-the-art performance on English speech transcription and translation while fully retaining TOWER’s original text-to-text translation capabilities. The core contribution is a novel paradigm—“speech discretization as languageization”—enabling end-to-end unified modeling of speech understanding and multilingual translation. The code and model are publicly released.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.
Problem

Research questions and friction points this paper is trying to address.

Extending text-only LLMs to process speech input.
Integrating speech as an additional language in multilingual LLMs.
Maintaining translation performance while adding speech transcription.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends LLM to speech via discretization.
Treats speech as additional translation language.
Maintains original performance on translation tasks.
🔎 Similar Papers
No similar papers found.