Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluation frameworks for legal reliability lack systematic benchmarks targeting subtle, adversarial, and latent errors in real-world contracts. Method: We introduce CLAUSE—the first stress-testing benchmark explicitly designed to assess legal reasoning fragility—comprising 10 categories of role-driven anomalous contracts that preserve legal fidelity. Built upon CUAD and ContractNLI, it contains over 7,500 perturbed contracts; fine-grained discrepancy detection and attribution are enabled via integrated RAG and statutory provision verification. Contribution/Results: Experiments reveal that state-of-the-art LLMs exhibit significant deficiencies in detecting nuanced legal deviations and generating legally sound explanations, exposing fundamental reasoning flaws in high-stakes legal tasks. CLAUSE fills a critical gap in legal AI robustness evaluation and establishes a reproducible, scalable assessment paradigm for trustworthy legal LMs.

Technology Category

Application Category

📝 Abstract
The rapid integration of large language models (LLMs) into high-stakes legal work has exposed a critical gap: no benchmark exists to systematically stress-test their reliability against the nuanced, adversarial, and often subtle flaws present in real-world contracts. To address this, we introduce CLAUSE, a first-of-its-kind benchmark designed to evaluate the fragility of an LLM's legal reasoning. We study the capabilities of LLMs to detect and reason about fine-grained discrepancies by producing over 7500 real-world perturbed contracts from foundational datasets like CUAD and ContractNLI. Our novel, persona-driven pipeline generates 10 distinct anomaly categories, which are then validated against official statutes using a Retrieval-Augmented Generation (RAG) system to ensure legal fidelity. We use CLAUSE to evaluate leading LLMs' ability to detect embedded legal flaws and explain their significance. Our analysis shows a key weakness: these models often miss subtle errors and struggle even more to justify them legally. Our work outlines a path to identify and correct such reasoning failures in legal AI.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs' legal reasoning reliability against real-world contract flaws
Evaluating models' ability to detect fine-grained discrepancies in contracts
Addressing LLMs' weakness in identifying and legally justifying subtle errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLAUSE benchmark evaluates LLM legal reasoning fragility
Persona-driven pipeline generates 10 anomaly contract categories
RAG system validates anomalies against official legal statutes
🔎 Similar Papers
No similar papers found.