🤖 AI Summary
This work addresses the challenge that existing automated L2 spoken language assessment systems struggle to simultaneously achieve multi-granularity scoring and interpretability. The authors propose a rubric-guided Speech Large Language Model (SpeechLLM) trained via a hybrid strategy combining supervised fine-tuning and Bounded Direct Preference Optimization (BDPO), enabling joint sentence-level and word/phoneme-level multidimensional scoring alongside natural language explanations. They introduce, for the first time, a dual-axis evaluation framework assessing generated rationales in terms of reasonableness and faithfulness. Experiments on the SpeechOcean762 dataset demonstrate that the model matches or exceeds the performance of single-granularity models in multi-granularity scoring, producing reasonable explanations at the sentence level, though faithfulness at finer granularities remains an area for improvement.
📝 Abstract
Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.