EvalYaks: Instruction Tuning Datasets and LoRA Fine-tuned Models for Automated Scoring of CEFR B2 Speaking Assessment Transcripts

📅 2024-08-22

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the scalability bottleneck in CEFR B2 oral assessment for e-learning—traditionally reliant on labor-intensive human evaluation—this paper proposes an automated scoring method based on dialogue transcript text. We construct expert-validated, CEFR-aligned synthetic datasets and instruction-tuning corpora, then efficiently fine-tune Mistral Instruct 7B v0.2 using LoRA. We introduce the multi-task EvalYaks model family, the first to jointly perform four-dimensional independent scoring, CEFR-level identification (for both lexical and textual features), and CEFR-aligned generation. Innovations include an India-context-adapted evaluation framework and a high-fidelity synthetic annotation paradigm. Experiments demonstrate an average accuracy of 96%, a mean CEFR-level deviation of only 0.35 levels, and performance three times that of the second-best model—validating the high-precision automation potential of 7B-scale LLMs in professional language assessment.

Technology Category

Application Category

📝 Abstract

Relying on human experts to evaluate CEFR speaking assessments in an e-learning environment creates scalability challenges, as it limits how quickly and widely assessments can be conducted. We aim to automate the evaluation of CEFR B2 English speaking assessments in e-learning environments from conversation transcripts. First, we evaluate the capability of leading open source and commercial Large Language Models (LLMs) to score a candidate's performance across various criteria in the CEFR B2 speaking exam in both global and India-specific contexts. Next, we create a new expert-validated, CEFR-aligned synthetic conversational dataset with transcripts that are rated at different assessment scores. In addition, new instruction-tuned datasets are developed from the English Vocabulary Profile (up to CEFR B2 level) and the CEFR-SP WikiAuto datasets. Finally, using these new datasets, we perform parameter efficient instruction tuning of Mistral Instruct 7B v0.2 to develop a family of models called EvalYaks. Four models in this family are for assessing the four sections of the CEFR B2 speaking exam, one for identifying the CEFR level of vocabulary and generating level-specific vocabulary, and another for detecting the CEFR level of text and generating level-specific text. EvalYaks achieved an average acceptable accuracy of 96%, a degree of variation of 0.35 levels, and performed 3 times better than the next best model. This demonstrates that a 7B parameter LLM instruction tuned with high-quality CEFR-aligned assessment data can effectively evaluate and score CEFR B2 English speaking assessments, offering a promising solution for scalable, automated language proficiency evaluation.

Problem

Research questions and friction points this paper is trying to address.

Automate CEFR B2 English speaking assessment scoring

Create expert-validated synthetic conversational datasets

Develop instruction-tuned models for scalable evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs for CEFR B2 speaking scoring

Creates expert-validated synthetic conversational dataset

Tunes Mistral 7B with CEFR-aligned datasets

🔎 Similar Papers

No similar papers found.

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Research Scientist Intern, Multimodal AI (PhD)