Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-tuning large audio-language models (LALMs) for spoken language understanding (SLU) remains challenging under speech-data scarcity. Method: We propose a text-guided progressive speech-text collaborative fine-tuning framework, integrating multilingual text pretraining, text-only fine-tuning, speech-text mixed training, and curriculum learning—achieving effective adaptation using only 2–5% of target-language speech data. Contribution/Results: This work is the first to systematically validate the efficacy of text-only fine-tuning for LALMs on SLU tasks; reveals that synergistic use of minimal speech data and high-quality text significantly boosts low-resource performance; and demonstrates that curriculum learning yields optimal gains under extreme data scarcity (<1% speech). Experiments show substantial improvements over strong baselines across multiple SLU benchmarks, establishing a scalable, cost-efficient paradigm for cross-lingual, low-resource speech understanding.

Technology Category

Application Category

📝 Abstract
Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning, especially with limited speech data. To bridge this gap, we systematically examine how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect spoken language understanding (SLU), focusing on scenarios where text-label pairs are abundant while paired speech-label data are limited. Results show that LALMs already achieve competitive performance with text-only fine-tuning, highlighting their strong generalization ability. Adding even small amounts of speech data (2-5%) yields substantial further gains, with curriculum learning particularly effective under scarce data. In cross-lingual SLU, combining source-language speech data with target-language text and minimal target-language speech data enables effective adaptation. Overall, this study provides practical insights into the LALM fine-tuning under realistic data constraints.
Problem

Research questions and friction points this paper is trying to address.

Fine-tuning Large Audio Language Models with limited speech data
Exploring effective fine-tuning schemes for spoken language understanding
Addressing cross-lingual adaptation with minimal target-language speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-only fine-tuning for generalization
Adding minimal speech data for gains
Curriculum learning under scarce data
🔎 Similar Papers
No similar papers found.
Youngwon Choi
Youngwon Choi
MAUM AI Inc.
Conversational AI
Jaeyoon Jung
Jaeyoon Jung
MAUM AI Inc, Soongsil University, Republic of Korea
MultimodalEmbodied AI
H
Hyeonyu Kim
MAUM AI Inc., Republic of Korea
H
Huu-Kim Nguyen
MAUM AI Inc., Republic of Korea
H
Hwayeon Kim
MAUM AI Inc., Republic of Korea