MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of a multidimensional, unified framework for evaluating response quality in AI mathematics tutors. We propose a joint evaluation method spanning four pedagogical dimensions: error identification, error localization, guidance provision, and actionability. Methodologically, we introduce an architecture-agnostic unified instruction-tuning framework enabling end-to-end, single-model optimization across all dimensions. To enhance discrimination of minority-class labels, we design a disagreement-aware ensemble inference strategy. Furthermore, robustness and scalability are improved via multi-task joint training and annotation consistency analysis. Evaluated on the BEA 2025 shared task, our approach achieves first place in guidance provision, third in actionability, and fourth in both error identification and error localization—demonstrating its effectiveness and state-of-the-art performance across all dimensions.

Technology Category

Application Category

📝 Abstract

We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI tutor responses across four instructional dimensions

Fine-tuning a single model for multiple tasks without architectural changes

Improving prediction reliability with disagreement-aware ensemble inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified training pipeline for multi-dimensional tuning

Disagreement-aware ensemble inference strategy

Scalable instruction tuning for robust evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow