Dialect Identification Using Resource-Efficient Fine-Tuning Approaches

📅 2025-10-22
🏛️ Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-tuning large speech models for dialect identification faces significant bottlenecks—including high computational cost, excessive GPU memory consumption, and slow training—while existing parameter-efficient fine-tuning (PEFT) methods remain suboptimal in memory and speed efficiency. This paper introduces Memory-Efficient Fine-Tuning (MEFT), the first application of MEFT to the speech domain, built upon the Whisper architecture for six-category Mandarin sub-dialect classification on the KeSpeech dataset. Our approach substantially reduces GPU memory usage—by up to 73.25%—and accelerates training by 2.1×, while maintaining recognition accuracy comparable to full fine-tuning. By jointly optimizing memory footprint and throughput without sacrificing performance, this work establishes a highly efficient, practical, and scalable fine-tuning paradigm for multivariate dialect speech recognition under resource-constrained conditions.

Technology Category

Application Category

📝 Abstract
Dialect Identification (DI) is a task to recognize different dialects within the same language from a speech signal. DI can help to improve the downstream speech related tasks even when speakers have a strong dialect. However, fine-tuning a speech model for tasks like DI is expensive in terms of computation cost and memory requirement. Recent studies have explored fine-tuning pre-trained speech models for tasks like DI using Parameter-Efficient Fine-Tuning (PEFT) methods, which offer parameter efficiency but limited improvement in memory efficiency and training speed. To address these challenges, we explore Memory-Efficient Fine-Tuning (MEFT) methods, originally proposed for language processing, and apply them to the generalpurpose pre-trained speech model. We then comprehensively analyze the GPU memory usage and fine-tuning speed based on various MEFT methods. As a case study, we fine-tune the Whisper model to identify six Mandarin subdialects from the KeSpeech dataset, reducing GPU memory usage by up to 73.25% and accelerating training speed by a factor of 2.1, while maintaining accuracy comparable to vanilla fine-tuning and PEFT methods.
Problem

Research questions and friction points this paper is trying to address.

Optimize dialect identification with memory-efficient fine-tuning
Reduce GPU memory usage and accelerate training speed
Maintain accuracy while improving computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-Efficient Fine-Tuning reduces GPU memory usage
MEFT methods accelerate training speed significantly
Applied to Whisper model for Mandarin dialect identification
🔎 Similar Papers
No similar papers found.
Z
Zirui Lin
Dept. of Systems and Control Engineering, School of Engineering, Institute of Science Tokyo, Tokyo, Japan
H
Haris Gulzar
NTT Software Innovation Center, Tokyo, Japan
M
Monnika Roslianna Busto
NTT Software Innovation Center, Tokyo, Japan
A
Akiko Masaki
NTT Software Innovation Center, Tokyo, Japan
Takeharu Eda
Takeharu Eda
NTT Software Innovation Center
Computer visionsurveillancedatabasesWebsearch
Kazuhiro Nakadai
Kazuhiro Nakadai
Institute of Science Tokyo
Robot Audition and Scene AnalysisArtificial IntelligenceSignal and Speech ProcessingRobotics