🤖 AI Summary
This study addresses the longstanding challenges in German legal education—namely, grader shortages and delayed feedback—by presenting the first systematic evaluation of 27 open- and closed-source large language models (LLMs) for automated scoring of law exam answers in criminal and public law. The authors propose a novel approach combining structured prompt engineering, incorporating exemplar answers and detailed rubrics, with model ensembling, and assess scoring consistency using quadratic weighted Cohen’s Kappa. Results demonstrate high inter-rater agreement in public law scoring (κ = 0.91), substantially outperforming criminal law (κ = 0.60). Furthermore, model ensembling yields an additional consistency gain of up to 0.15, establishing a scalable and highly reliable paradigm for automated assessment in legal education.
📝 Abstract
Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.