Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses critical reliability issues—pervasive hallucinations, insufficient utilization of legal factors, and uncontrolled generation—in large language models (LLMs) performing three-part legal argumentation. To this end, we introduce the first automated evaluation pipeline for legal reasoning. Methodologically, we formally define and quantify two novel failure modes: “factor-level hallucination” and “instruction-driven refusal,” and propose a consistency verification paradigm grounded in external LLM-based factor extraction and ground-truth alignment, supporting three progressively challenging test categories. Experiments across eight state-of-the-art LLMs reveal >90% accuracy in hallucination control, yet consistently weak factor utilization; notably, most models still generate spurious arguments under refusal prompts, exposing fundamental trustworthiness gaps. Our contribution is a scalable, reproducible benchmark framework and new metrics for assessing reliability in legal AI systems.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) demonstrate potential in complex legal tasks like argument generation, yet their reliability remains a concern. Building upon pilot work assessing LLM generation of 3-ply legal arguments using human evaluation, this paper introduces an automated pipeline to evaluate LLM performance on this task, specifically focusing on faithfulness (absence of hallucination), factor utilization, and appropriate abstention. We define hallucination as the generation of factors not present in the input case materials and abstention as the model's ability to refrain from generating arguments when instructed and no factual basis exists. Our automated method employs an external LLM to extract factors from generated arguments and compares them against the ground-truth factors provided in the input case triples (current case and two precedent cases). We evaluated eight distinct LLMs on three tests of increasing difficulty: 1) generating a standard 3-ply argument, 2) generating an argument with swapped precedent roles, and 3) recognizing the impossibility of argument generation due to lack of shared factors and abstaining. Our findings indicate that while current LLMs achieve high accuracy (over 90%) in avoiding hallucination on viable argument generation tests (Tests 1&2), they often fail to utilize the full set of relevant factors present in the cases. Critically, on the abstention test (Test 3), most models failed to follow instructions to stop, instead generating spurious arguments despite the lack of common factors. This automated pipeline provides a scalable method for assessing these crucial LLM behaviors, highlighting the need for improvements in factor utilization and robust abstention capabilities before reliable deployment in legal settings. Project page: https://github.com/lizhang-AIandLaw/Measuring-Faithfulness-and-Abstention.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-generated legal arguments for faithfulness and abstention.

Assessing hallucination and factor utilization in LLM legal outputs.

Testing LLM ability to abstain when no factual basis exists.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline evaluates LLM legal argument generation

External LLM extracts and compares argument factors

Tests include standard, swapped, and abstention scenarios

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval