GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of evaluating large language models’ (LLMs) legal reasoning and statutory/case-law citation capabilities on the Greek Bar Examination—a task lacking dedicated benchmarks. We introduce the first free-text legal reasoning benchmark for this examination, covering five core domains: constitutional, civil, criminal, administrative, and procedural law. Methodologically, we propose a three-dimensional scoring scheme and an “LLM-as-a-judge” automated evaluation framework, validated via a meta-evaluation benchmark to assess adjudicator consistency. We find that concise, snippet-based scoring rules significantly improve agreement between LLM judges and human experts (Cohen’s κ = 0.82). Our system integrates citation localization, multi-dimensional automated scoring, and calibration mechanisms to evaluate 13 state-of-the-art LLMs. The top-performing model exceeds the human expert mean score but falls short of the top 5% percentile, revealing critical limitations in rigorous legal citation and logically closed reasoning.

Technology Category

Application Category

📝 Abstract
We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on Greek Bar exam legal questions with citations
Proposes 3D scoring and LLM-as-judge for free-text evaluation
Assesses LLM-judge and human expert evaluation alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-dimensional scoring system for evaluation
LLM-as-a-judge approach for assessment
Meta-evaluation benchmark for alignment improvement
🔎 Similar Papers
No similar papers found.