LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

170K/year
🤖 AI Summary
This study addresses the challenge of automating credibility assessment in Danish asylum decision texts—a task hindered by the language’s low-resource status, legal domain specificity, and the need for fine-grained judgments. The authors introduce RAB-Cred, the first high-quality dataset annotated with both annotator confidence scores and case outcomes, and systematically evaluate 21 open-source large language models (LLMs) under zero-shot and few-shot settings. Through comprehensive experiments employing 30 prompting strategies, complemented by error analysis, confusion matrices, correlation with human confidence, and sample difficulty assessment, this work pioneers the application of LLMs to this task. While demonstrating their potential for cost-effective annotation, the findings also reveal substantial inconsistencies and limitations in individual models, underscoring the necessity of multi-model ensemble approaches to enhance reliability.
📝 Abstract
Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred
Problem

Research questions and friction points this paper is trying to address.

credibility assessment
asylum decisions
legal NLP
underrepresented languages
text classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

large language models
credibility assessment
legal NLP
low-resource languages
annotation error analysis
Galadrielle Humblot-Renaux
Galadrielle Humblot-Renaux
Visual Analysis and Perception lab, Aalborg University
deep learningcomputer visionuncertaintyout-of-distribution detection
M
Mohammad N. S. Jahromi
Visual Analysis and Perception Lab, Aalborg University; Center of Excellence for Global Mobility Law, University of Copenhagen; Pioneer Center for AI, Denmark
R
Rohat Bakuri-Jørgensen
Visual Analysis and Perception Lab, Aalborg University
M
Marieke Anne Heyl
Center of Excellence for Global Mobility Law, University of Copenhagen
A
Asta S. Stage Jarlner
Center of Excellence for Global Mobility Law, University of Copenhagen
M
Maria Vlachou
Department of Computer Science, University of Copenhagen
A
Anna Murphy Høgenhaug
Center of Excellence for Global Mobility Law, University of Copenhagen
Desmond Elliott
Desmond Elliott
Associate Professor, University of Copenhagen
Natural Language ProcessingVision-LanguageTokenization-free Language Models
Thomas Gammeltoft-Hansen
Thomas Gammeltoft-Hansen
Professor of migration and mobility law; Director, Center of Excellence for Global Mobility Law
human mobilitymigration/refugeesinternational lawAI & lawlegal theory
T
Thomas B. Moeslund
Visual Analysis and Perception Lab, Aalborg University; Pioneer Center for AI, Denmark