MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a multidimensional, unified framework for evaluating response quality in AI mathematics tutors. We propose a joint evaluation method spanning four pedagogical dimensions: error identification, error localization, guidance provision, and actionability. Methodologically, we introduce an architecture-agnostic unified instruction-tuning framework enabling end-to-end, single-model optimization across all dimensions. To enhance discrimination of minority-class labels, we design a disagreement-aware ensemble inference strategy. Furthermore, robustness and scalability are improved via multi-task joint training and annotation consistency analysis. Evaluated on the BEA 2025 shared task, our approach achieves first place in guidance provision, third in actionability, and fourth in both error identification and error localization—demonstrating its effectiveness and state-of-the-art performance across all dimensions.

Technology Category

Application Category

📝 Abstract
We present MSA-MathEval, our submission to the BEA 2025 Shared Task on evaluating AI tutor responses across four instructional dimensions: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. Our approach uses a unified training pipeline to fine-tune a single instruction-tuned language model across all tracks, without any task-specific architectural changes. To improve prediction reliability, we introduce a disagreement-aware ensemble inference strategy that enhances coverage of minority labels. Our system achieves strong performance across all tracks, ranking 1st in Providing Guidance, 3rd in Actionability, and 4th in both Mistake Identification and Mistake Location. These results demonstrate the effectiveness of scalable instruction tuning and disagreement-driven modeling for robust, multi-dimensional evaluation of LLMs as educational tutors.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI tutor responses across four instructional dimensions
Fine-tuning a single model for multiple tasks without architectural changes
Improving prediction reliability with disagreement-aware ensemble inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified training pipeline for multi-dimensional tuning
Disagreement-aware ensemble inference strategy
Scalable instruction tuning for robust evaluation
🔎 Similar Papers
No similar papers found.
B
Baraa Hikal
Faculty of Computer Science, MSA University, Egypt
Mohamed Basem
Mohamed Basem
Student at Computer Science, MSA University
I
Islam Oshallah
Faculty of Computer Science, MSA University, Egypt
Ali Hamdi
Ali Hamdi
Computer Science, MSA University
Computer VisionDeep LearningText Mining