InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current speech large language models (SpeechLLMs) exhibit significantly weaker performance on speech instruction-following tasks compared to text-based inputs, primarily due to semantic misalignment between speech and text representations. To address this, we propose an interleaved speech–text self-supervised pretraining paradigm that requires no manually curated paired data: pseudo speech–text pairs are synthesized via text-to-speech (TTS), and interleaved speech and text sequences are jointly modeled to achieve scalable cross-modal representation alignment. Our key contributions are threefold: (1) the first unsupervised interleaved pretraining framework for speech–text joint modeling; (2) SpeechInstructBench—the first benchmark specifically designed for evaluating speech instruction-following capability; and (3) state-of-the-art performance on SpeechInstructBench, alongside superior or competitive results across diverse speech understanding and generation tasks.

Technology Category

Application Category

📝 Abstract

Recent advancements in speech large language models (SpeechLLMs) have attracted considerable attention. Nonetheless, current methods exhibit suboptimal performance in adhering to speech instructions. Notably, the intelligence of models significantly diminishes when processing speech-form input as compared to direct text-form input. Prior work has attempted to mitigate this semantic inconsistency between speech and text representations through techniques such as representation and behavior alignment, which involve the meticulous design of data pairs during the post-training phase. In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training. InSerter is designed to pre-train large-scale unsupervised speech-text sequences, where the speech is synthesized from randomly selected segments of an extensive text corpus using text-to-speech conversion. Consequently, the model acquires the ability to generate textual continuations corresponding to the provided speech segments, obviating the need for intensive data design endeavors. To systematically evaluate speech instruction-following capabilities, we introduce SpeechInstructBench, the first comprehensive benchmark specifically designed for speech-oriented instruction-following tasks. Our proposed InSerter achieves SOTA performance in SpeechInstructBench and demonstrates superior or competitive results across diverse speech processing tasks.

Problem

Research questions and friction points this paper is trying to address.

Improves speech instruction adherence in SpeechLLMs

Reduces semantic inconsistency between speech and text

Introduces scalable unsupervised speech-text pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised interleaved speech-text pre-training

Synthesized speech from text corpus segments

SpeechInstructBench for instruction-following evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow