🤖 AI Summary
To address the data and parameter inefficiency in end-to-end joint modeling of speech encoders and large language models (LLMs) under low-resource conditions, this paper proposes a parameter-efficient speech–language adapter. The adapter employs lightweight trainable modules to map speech embeddings into semantically meaningful tokens compatible with LLMs. It innovatively integrates three techniques: synthetic data annotation generated by the LLM itself, classifier regularization, and LoRA-based fine-tuning—enabling unified multimodal understanding across automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). Experiments demonstrate that the adapter reduces trainable parameters by 7×; achieves a 26% relative WER reduction on LibriSpeech; improves NER and SA F1 scores by 6.3% and 32%, respectively; and attains up to a 9.5-point gain in the SLUE composite score. These results substantiate significant improvements in cross-modal alignment efficiency and low-resource generalization capability.
📝 Abstract
While integrating speech encoder with LLM requires substantial data and resources, use cases face limitations due to insufficient availability. To address this, we propose a solution with a parameter-efficient adapter that converts speech embeddings into LLM-compatible tokens, focusing on end-to-end automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). To reduce labeling costs, we employ an LLM-based synthetic dataset annotation technique. The proposed adapter, using 7x fewer trainable parameters, achieves significant performance gains: a 26% relative Word Error Rates (WER) improvement on the LibriSpeech ASR task, a 6.3% relative F1 score increase on the NER task, and a 32% relative F1 score boost on the SA task. Moreover, using advanced techniques such as adding a classifier regularizer and optimizing the LLM with Low-Rank Adaptation (LoRA) yields notable performance gains, with Spoken Language Understanding Evaluation (SLUE) score improvement of 6.6% and 9.5%