RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation?

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This study addresses AI teaching assistant evaluation under resource-constrained settings, investigating the practical performance limits of sub-billion-parameter models. We propose a lightweight fine-tuning framework based on small-scale Transformers, integrating task-specific prompt engineering and a compact classification head—enabling efficient end-to-end execution on commodity hardware (e.g., single-GPU or CPU systems). To our knowledge, this is the first systematic evaluation demonstrating that ultra-lightweight models can achieve competitive performance across multiple teaching-assistant assessment tracks: their average Exact F1 score across five subtasks lags behind the top-performing model by only 6.46–13.13 points. The approach offers Global South institutions a reproducible, low-cost, and high-cost-effectiveness solution, thereby overcoming the resource dependency barrier that has hindered large-model adoption in educational assessment.

Technology Category

Application Category

📝 Abstract

In this paper, we present the RETUYT-INCO participation at the BEA 2025 shared task. Our participation was characterized by the decision of using relatively small models, with fewer than 1B parameters. This self-imposed restriction tries to represent the conditions in which many research labs or institutions are in the Global South, where computational power is not easily accessible due to its prohibitive cost. Even under this restrictive self-imposed setting, our models managed to stay competitive with the rest of teams that participated in the shared task. According to the $exact F_1$ scores published by the organizers, the performance gaps between our models and the winners were as follows: $6.46$ in Track 1; $10.24$ in Track 2; $7.85$ in Track 3; $9.56$ in Track 4; and $13.13$ in Track 5. Considering that the minimum difference with a winner team is $6.46$ points -- and the maximum difference is $13.13$ -- according to the $exact F_1$ score, we find that models with a size smaller than 1B parameters are competitive for these tasks, all of which can be run on computers with a low-budget GPU or even without a GPU.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI tutor performance with lightweight models

Assessing model competitiveness under limited computational resources

Comparing small models (<1B params) against larger ones in shared tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used lightweight models under 1B parameters

Competitive performance with low computational cost

Optimized for low-budget GPU or CPU usage

🔎 Similar Papers

Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach