A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that existing automated L2 spoken language assessment systems struggle to simultaneously achieve multi-granularity scoring and interpretability. The authors propose a rubric-guided Speech Large Language Model (SpeechLLM) trained via a hybrid strategy combining supervised fine-tuning and Bounded Direct Preference Optimization (BDPO), enabling joint sentence-level and word/phoneme-level multidimensional scoring alongside natural language explanations. They introduce, for the first time, a dual-axis evaluation framework assessing generated rationales in terms of reasonableness and faithfulness. Experiments on the SpeechOcean762 dataset demonstrate that the model matches or exceeds the performance of single-granularity models in multi-granularity scoring, producing reasonable explanations at the sentence level, though faithfulness at finer granularities remains an area for improvement.

📝 Abstract

Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.

Problem

Research questions and friction points this paper is trying to address.

L2 speech assessment

interpretability

multi-granular evaluation

natural-language rationales

automated scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

SpeechLLM

multi-granular assessment

natural-language rationales