SpeechLLM: Unified Speech and Language Model for Enhanced Multi-Task Understanding in Low Resource Settings

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

159K/year
🤖 AI Summary
To address the data and parameter inefficiency in end-to-end joint modeling of speech encoders and large language models (LLMs) under low-resource conditions, this paper proposes a parameter-efficient speech–language adapter. The adapter employs lightweight trainable modules to map speech embeddings into semantically meaningful tokens compatible with LLMs. It innovatively integrates three techniques: synthetic data annotation generated by the LLM itself, classifier regularization, and LoRA-based fine-tuning—enabling unified multimodal understanding across automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). Experiments demonstrate that the adapter reduces trainable parameters by 7×; achieves a 26% relative WER reduction on LibriSpeech; improves NER and SA F1 scores by 6.3% and 32%, respectively; and attains up to a 9.5-point gain in the SLUE composite score. These results substantiate significant improvements in cross-modal alignment efficiency and low-resource generalization capability.

Technology Category

Application Category

📝 Abstract
While integrating speech encoder with LLM requires substantial data and resources, use cases face limitations due to insufficient availability. To address this, we propose a solution with a parameter-efficient adapter that converts speech embeddings into LLM-compatible tokens, focusing on end-to-end automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). To reduce labeling costs, we employ an LLM-based synthetic dataset annotation technique. The proposed adapter, using 7x fewer trainable parameters, achieves significant performance gains: a 26% relative Word Error Rates (WER) improvement on the LibriSpeech ASR task, a 6.3% relative F1 score increase on the NER task, and a 32% relative F1 score boost on the SA task. Moreover, using advanced techniques such as adding a classifier regularizer and optimizing the LLM with Low-Rank Adaptation (LoRA) yields notable performance gains, with Spoken Language Understanding Evaluation (SLUE) score improvement of 6.6% and 9.5%
Problem

Research questions and friction points this paper is trying to address.

Integrating speech and text models with limited data
Reducing resource needs for multi-task speech understanding
Improving low-resource speech recognition and understanding performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-efficient adapter for speech-to-LLM token conversion
LLM-based synthetic dataset annotation reduces labeling costs
Classifier regularizer and LoRA optimization enhance performance gains