LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study empirically evaluates, for the first time, the capability of leading large language models (LLMs)—GPT-4.1, Claude 3 Sonnet, and Bielik-11B-v2.6—to pass Poland’s National Appeals Court membership examination, focusing on public procurement law via multiple-choice knowledge questions and written judicial judgment drafting. Method: A closed-book setting was adopted, augmented with multi-architecture retrieval-augmented generation (RAG) to enhance legal information retrieval and statutory/textual extraction. Contribution/Results: All models achieved only moderate performance on knowledge assessment but failed to meet the passing threshold on judgment writing. Systematic deficiencies were observed in legal reasoning, precedent citation, and logical argumentation. Furthermore, LLM-based automated scoring diverged significantly from official adjudication committee evaluations. The findings expose critical limitations of current legal LMs in high-stakes judicial tasks, underscore the necessity of human–AI collaborative evaluation, and provide novel empirical evidence for benchmarking legal AI capabilities and refining evaluation paradigms.

Technology Category

Application Category

📝 Abstract
This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland's National Appeal Chamber (Krajowa Izba Odwo{l}awcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the'LLM-as-a-judge'approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the'LLM-as-a-judge'often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to pass Poland's legal qualification exam for judges
Testing automated evaluation of legal answers using LLM-as-judge methodology
Identifying limitations in legal reasoning and citation accuracy of current LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid information recovery and extraction pipeline
Retrieval-Augmented Generation settings for legal exams
Automated evaluation using LLM-as-a-judge approach
🔎 Similar Papers
No similar papers found.