Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This work addresses the challenge that existing automated scoring systems struggle to accurately emulate how instructors evaluate C++ programming assignments according to rubric-based criteria. To bridge this gap, the authors propose a calibration-aware multi-task learning framework built upon a BART encoder-decoder architecture. The approach integrates rubric context and soft boundary labels, and introduces a distribution-matching loss to align the predicted score distribution with the ground-truth distribution. Coupled with LoRA-based parameter-efficient fine-tuning, the model jointly optimizes score regression and grade classification in an end-to-end manner. Experimental results demonstrate that the proposed method significantly reduces mean squared error and improves fidelity to the true score distribution, outperforming both T5 and pairwise pretraining variants, thereby enhancing the alignment between automated scoring and human instructor judgments.

📝 Abstract

This paper investigates rubric-aware, multitask fine-tuning of transformer models for automated grading of introductory C++ programming assignments, with the goal of producing grade predictions that better reflect instructor grading behavior than general-purpose LLMs. Using multi-semester CS1 data, student submissions are paired with numeric scores, letter-grade buckets, and assignment rubrics, then preprocessed into unified sequences for transformer input. A BART encoder-decoder with LoRA adaptation is trained to jointly predict numeric grades and grade buckets, augmented with a distribution-matching term to align predicted and empirical grade distributions, an evaluation dimension often overlooked in prior work. Experiments compare single-task and multitask training, hard one-hot versus fuzzy and boundary-based soft labels, and rubric versus no-rubric conditions, with additional T5 and pairwise-pretrained variants. Results show that multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than single-task, hard-label, or code-only baselines. Fully fine-tuned T5 further improves distributional fidelity, while pairwise pretraining reduces numeric error at the cost of minority-class sensitivity. Collectively, the findings suggest that calibration-aware, rubric-guided training produces more instructor-like grading behavior than accuracy-optimized alternatives.

Problem

Research questions and friction points this paper is trying to address.

automated grading

rubric-based assessment

CS1 programming assignments

grade prediction

instructor-like grading

Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric-aware grading

multitask fine-tuning

distribution matching