SpeechLLM: Unified Speech and Language Model for Enhanced Multi-Task Understanding in Low Resource Settings

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

To address the data and parameter inefficiency in end-to-end joint modeling of speech encoders and large language models (LLMs) under low-resource conditions, this paper proposes a parameter-efficient speech–language adapter. The adapter employs lightweight trainable modules to map speech embeddings into semantically meaningful tokens compatible with LLMs. It innovatively integrates three techniques: synthetic data annotation generated by the LLM itself, classifier regularization, and LoRA-based fine-tuning—enabling unified multimodal understanding across automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). Experiments demonstrate that the adapter reduces trainable parameters by 7×; achieves a 26% relative WER reduction on LibriSpeech; improves NER and SA F1 scores by 6.3% and 32%, respectively; and attains up to a 9.5-point gain in the SLUE composite score. These results substantiate significant improvements in cross-modal alignment efficiency and low-resource generalization capability.

Technology Category

Application Category

📝 Abstract

While integrating speech encoder with LLM requires substantial data and resources, use cases face limitations due to insufficient availability. To address this, we propose a solution with a parameter-efficient adapter that converts speech embeddings into LLM-compatible tokens, focusing on end-to-end automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). To reduce labeling costs, we employ an LLM-based synthetic dataset annotation technique. The proposed adapter, using 7x fewer trainable parameters, achieves significant performance gains: a 26% relative Word Error Rates (WER) improvement on the LibriSpeech ASR task, a 6.3% relative F1 score increase on the NER task, and a 32% relative F1 score boost on the SA task. Moreover, using advanced techniques such as adding a classifier regularizer and optimizing the LLM with Low-Rank Adaptation (LoRA) yields notable performance gains, with Spoken Language Understanding Evaluation (SLUE) score improvement of 6.6% and 9.5%

Problem

Research questions and friction points this paper is trying to address.

Integrating speech and text models with limited data

Reducing resource needs for multi-task speech understanding

Improving low-resource speech recognition and understanding performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-efficient adapter for speech-to-LLM token conversion

LLM-based synthetic dataset annotation reduces labeling costs

Classifier regularizer and LoRA optimization enhance performance gains

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models