Dialect Identification Using Resource-Efficient Fine-Tuning Approaches

📅 2025-10-22

🏛️ Asia-Pacific Signal and Information Processing Association Annual Summit and Conference

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Fine-tuning large speech models for dialect identification faces significant bottlenecks—including high computational cost, excessive GPU memory consumption, and slow training—while existing parameter-efficient fine-tuning (PEFT) methods remain suboptimal in memory and speed efficiency. This paper introduces Memory-Efficient Fine-Tuning (MEFT), the first application of MEFT to the speech domain, built upon the Whisper architecture for six-category Mandarin sub-dialect classification on the KeSpeech dataset. Our approach substantially reduces GPU memory usage—by up to 73.25%—and accelerates training by 2.1×, while maintaining recognition accuracy comparable to full fine-tuning. By jointly optimizing memory footprint and throughput without sacrificing performance, this work establishes a highly efficient, practical, and scalable fine-tuning paradigm for multivariate dialect speech recognition under resource-constrained conditions.

Technology Category

Application Category

📝 Abstract

Dialect Identification (DI) is a task to recognize different dialects within the same language from a speech signal. DI can help to improve the downstream speech related tasks even when speakers have a strong dialect. However, fine-tuning a speech model for tasks like DI is expensive in terms of computation cost and memory requirement. Recent studies have explored fine-tuning pre-trained speech models for tasks like DI using Parameter-Efficient Fine-Tuning (PEFT) methods, which offer parameter efficiency but limited improvement in memory efficiency and training speed. To address these challenges, we explore Memory-Efficient Fine-Tuning (MEFT) methods, originally proposed for language processing, and apply them to the generalpurpose pre-trained speech model. We then comprehensively analyze the GPU memory usage and fine-tuning speed based on various MEFT methods. As a case study, we fine-tune the Whisper model to identify six Mandarin subdialects from the KeSpeech dataset, reducing GPU memory usage by up to 73.25% and accelerating training speed by a factor of 2.1, while maintaining accuracy comparable to vanilla fine-tuning and PEFT methods.

Problem

Research questions and friction points this paper is trying to address.

Optimize dialect identification with memory-efficient fine-tuning

Reduce GPU memory usage and accelerate training speed

Maintain accuracy while improving computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-Efficient Fine-Tuning reduces GPU memory usage

MEFT methods accelerate training speed significantly

Applied to Whisper model for Mandarin dialect identification

🔎 Similar Papers

No similar papers found.

Authors to Follow