🤖 AI Summary
A systematic evaluation of domain-specific large language models (LLMs) for legal text understanding—particularly contract classification—remains lacking. Method: This work presents the first comprehensive benchmark of ten legal-domain LLMs against seven general-purpose LLMs across three English contract understanding tasks, employing a multi-task evaluation framework emphasizing text classification and semantic understanding. Contribution/Results: Legal-specialized models significantly outperform general-purpose models, especially on tasks requiring fine-grained legal reasoning. Legal-BERT and Contracts-BERT achieve new state-of-the-art (SOTA) results on two tasks despite their relatively small parameter counts. CaseLaw-BERT and LexLM demonstrate strong baseline performance. Collectively, this study establishes a critical benchmark and provides empirically grounded guidance for model selection in contract understanding systems, advancing the development of precise, task-adapted legal AI.
📝 Abstract
Despite advances in legal NLP, no comprehensive evaluation covering multiple legal-specific LLMs currently exists for contract classification tasks in contract understanding. To address this gap, we present an evaluation of 10 legal-specific LLMs on three English language contract understanding tasks and compare them with 7 general-purpose LLMs. The results show that legal-specific LLMs consistently outperform general-purpose models, especially on tasks requiring nuanced legal understanding. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing general-purpose LLM. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract understanding. Our results provide a holistic evaluation of legal-specific LLMs and will facilitate the development of more accurate contract understanding systems.