🤖 AI Summary
This study addresses the challenge of complex joint training required for automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) in second-language learners. We propose an end-to-end multimodal large language model (LLM) framework leveraging Low-Rank Adaptation (LoRA) for unified, parameter-efficient fine-tuning. To our knowledge, this is the first application of LoRA to a speech-text multimodal LLM—specifically Phi-4-multimodal-instruct—eliminating architectural modifications and avoiding separate multi-task training. Evaluated on Speechocean762, our model achieves a Pearson correlation coefficient of 0.71 with human ratings on APA and sub-0.15 word and phoneme error rates on MDD, matching full-parameter fine-tuning performance. The approach establishes a lightweight, unified, and scalable multi-task pronunciation assessment framework, significantly improving deployment efficiency and cross-dataset generalization.
📝 Abstract
This study demonstrates that a Multimodal Large Language Model (MLLM) adapted via Low-Rank Adaptation (LoRA) can perform both Automatic Pronunciation Assessment (APA) and Mispronunciation Detection and Diagnosis (MDD) simultaneously. Leveraging Microsoft's Phi-4-multimodal-instruct, our fine-tuning method eliminates the need for complex architectural changes or separate training procedures conventionally required for these distinct tasks. Fine-tuned on the Speechocean762 dataset, the pronunciation evaluation scores predicted by the model exhibited a strong Pearson Correlation Coefficient (PCC > 0.7) with human-assigned scores, while achieving low Word Error Rate (WER) and Phoneme Error Rate (PER) (both < 0.15). Notably, fine-tuning only the LoRA layers was sufficient to achieve performance levels comparable to those achieved by fine-tuning all audio layers. This research highlights that an integrated pronunciation assessment system can be established by adapting large multimodal models without full fine-tuning, utilizing a significantly simpler training methodology compared to previous joint models designed for simultaneous APA and MDD. This efficient LoRA-based approach paves the way for more accessible, integrated, and effective Computer-Assisted Pronunciation Training (CAPT) technologies for English L2 learners.