🤖 AI Summary
This work addresses the dual challenge of modality fusion and performance preservation in speech transcription and translation by extending monolingual text-based large language models (e.g., TOWER) to multilingual speech–text unified models. Methodologically, it pioneers treating discretized speech—specifically SoundStream token sequences—as a “(N+1)-th language” integrated into a multilingual LLM architecture. Through multi-stage continual pretraining and cross-lingual speech–text alignment modeling, the open-source model SPIRE is developed. Experiments demonstrate that SPIRE achieves state-of-the-art performance on English speech transcription and translation while fully retaining TOWER’s original text-to-text translation capabilities. The core contribution is a novel paradigm—“speech discretization as languageization”—enabling end-to-end unified modeling of speech understanding and multilingual translation. The code and model are publicly released.
📝 Abstract
Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.