Automatic Legal Writing Evaluation of LLMs

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Legal writing assessment lacks publicly available, dynamically updated benchmarks with fine-grained scoring criteria. Method: This paper introduces OAB-Bench—the first open-source evaluation benchmark for Brazil’s Bar Examination (OAB)—comprising 105 authentic exam questions, structured rubrics, and human grading rationales. It pioneers the conversion of a real-world judicial examination system into a reproducible legal writing evaluation framework. Zero-shot generation and automated scoring experiments are conducted using state-of-the-art models (e.g., Claude-3.5 Sonnet, OpenAI o1), with reliability validated via correlation analysis and alignment with official scoring standards. Results: Claude-3.5 Sonnet achieves a mean score of 7.93/10 across all subjects, meeting passing thresholds; o1 demonstrates strong agreement (r > 0.8) with human graders on passed responses, confirming LLMs’ viability as high-fidelity automated assessors. This work establishes the first domain-specific benchmark for legal writing evaluation, advancing professional writing assessment through a rigorous, scalable paradigm.

Technology Category

Application Category

📝 Abstract

Despite the recent advances in Large Language Models, benchmarks for evaluating legal writing remain scarce due to the inherent complexity of assessing open-ended responses in this domain. One of the key challenges in evaluating language models on domain-specific tasks is finding test datasets that are public, frequently updated, and contain comprehensive evaluation guidelines. The Brazilian Bar Examination meets these requirements. We introduce oab-bench, a benchmark comprising 105 questions across seven areas of law from recent editions of the exam. The benchmark includes comprehensive evaluation guidelines and reference materials used by human examiners to ensure consistent grading. We evaluate the performance of four LLMs on oab-bench, finding that Claude-3.5 Sonnet achieves the best results with an average score of 7.93 out of 10, passing all 21 exams. We also investigated whether LLMs can serve as reliable automated judges for evaluating legal writing. Our experiments show that frontier models like OpenAI's o1 achieve a strong correlation with human scores when evaluating approved exams, suggesting their potential as reliable automated evaluators despite the inherently subjective nature of legal writing assessment. The source code and the benchmark -- containing questions, evaluation guidelines, model-generated responses, and their respective automated evaluations -- are publicly available.

Problem

Research questions and friction points this paper is trying to address.

Evaluating legal writing in LLMs lacks benchmarks due to complexity

Finding updated, public test datasets for domain-specific tasks is challenging

Assessing reliability of LLMs as automated judges for legal writing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces oab-bench with 105 legal questions

Evaluates LLMs using Brazilian Bar Examination data

Tests LLMs as automated legal writing evaluators

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval