🤖 AI Summary
This work addresses the “silent scope omission” (SSO) problem in legal and policy texts, which arises when nested exceptions are overlooked. To tackle this, the authors propose Span-Grounded Deontic Trees (SG-DT), a compiler-inspired intermediate representation that explicitly models coverage relations among clauses and anchors them to source text spans. SG-DT incorporates a guard-exclusion mechanism to ensure deterministic compilation and auditability. Building on this framework, the study introduces NormBench, a multilingual, multi-domain regulatory benchmark that systematically reveals performance degradation and auditability pitfalls in large language models as recursive depth increases. Experimental results demonstrate that SG-DT substantially improves whole-tree fidelity and exception recovery, with particularly pronounced gains in high-risk SSO scenarios.
📝 Abstract
Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general rule but silently drops nested exceptions or counter-exceptions, producing outputs that appear compliant yet break on important edge cases. Although such failures are often framed as an agentic-systems problem, the underlying bottleneck is statutory and policy understanding, a capability typically studied in legal NLP. However, most existing legal NLP benchmarks emphasize end-task outcomes, which can overlook the structural omissions that cause SSO. To diagnose and mitigate SSO, we introduce NormBench, a benchmark of 2,290 provisions spanning Chinese (laws and local policies), English (U.S. tax law, GDPR, and corporate policies), and cross-lingual settings, designed for defeasible scope parsing: identifying precisely which clause overrides which. NormBench uses Span-Grounded Deontic Trees (SG-DT), a compiler-style intermediate representation that anchors every logical branch to source spans and requires explicit exclusion guards, enabling deterministic compilation and audit. Evaluations of frontier LLMs reveal two recurring pathologies: (1) Recursion Decay, where performance drops sharply as defeater depth increases, and (2) an Auditability Trap, where models retrieve relevant spans but fail to assemble correct control flow. Using SG-DT as a constrained intermediate output improves whole-tree fidelity and defeater recovery, and downstream experiments show that its utility is mechanism-specific: gains concentrate on exception-active, SSO-prone cases, while aggregate accuracy can be mixed when the added structure is unnecessary or parser fidelity is low.