English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of complex joint training required for automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) in second-language learners. We propose an end-to-end multimodal large language model (LLM) framework leveraging Low-Rank Adaptation (LoRA) for unified, parameter-efficient fine-tuning. To our knowledge, this is the first application of LoRA to a speech-text multimodal LLM—specifically Phi-4-multimodal-instruct—eliminating architectural modifications and avoiding separate multi-task training. Evaluated on Speechocean762, our model achieves a Pearson correlation coefficient of 0.71 with human ratings on APA and sub-0.15 word and phoneme error rates on MDD, matching full-parameter fine-tuning performance. The approach establishes a lightweight, unified, and scalable multi-task pronunciation assessment framework, significantly improving deployment efficiency and cross-dataset generalization.

Technology Category

Application Category

📝 Abstract
This study demonstrates that a Multimodal Large Language Model (MLLM) adapted via Low-Rank Adaptation (LoRA) can perform both Automatic Pronunciation Assessment (APA) and Mispronunciation Detection and Diagnosis (MDD) simultaneously. Leveraging Microsoft's Phi-4-multimodal-instruct, our fine-tuning method eliminates the need for complex architectural changes or separate training procedures conventionally required for these distinct tasks. Fine-tuned on the Speechocean762 dataset, the pronunciation evaluation scores predicted by the model exhibited a strong Pearson Correlation Coefficient (PCC > 0.7) with human-assigned scores, while achieving low Word Error Rate (WER) and Phoneme Error Rate (PER) (both < 0.15). Notably, fine-tuning only the LoRA layers was sufficient to achieve performance levels comparable to those achieved by fine-tuning all audio layers. This research highlights that an integrated pronunciation assessment system can be established by adapting large multimodal models without full fine-tuning, utilizing a significantly simpler training methodology compared to previous joint models designed for simultaneous APA and MDD. This efficient LoRA-based approach paves the way for more accessible, integrated, and effective Computer-Assisted Pronunciation Training (CAPT) technologies for English L2 learners.
Problem

Research questions and friction points this paper is trying to address.

Simultaneously perform automatic pronunciation assessment and mispronunciation detection
Eliminate need for complex architectural changes in pronunciation evaluation
Establish integrated assessment system without full model fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

LoRA fine-tuned multimodal LLM for pronunciation evaluation
Eliminates complex joint training and architectural changes
Achieves strong correlation with human scores efficiently
🔎 Similar Papers
No similar papers found.