π€ AI Summary
This study addresses the urgent need for cross-jurisdictional contract review capabilities in multinational enterprises, a challenge largely unmet by existing legal NLP research, which predominantly focuses on single jurisdictions and lacks systematic annotation of clause equivalence within common law systems. To bridge this gap, we introduce LAUKIN, the first multi-jurisdictional dataset of contract clause pairs spanning Australia, the United Kingdom, and India. We generate candidate pairs through multi-stage retrieval and re-ranking, followed by expert legal annotation of 3,000 clause pairs for legal equivalence out of 14,727 total pairs. Evaluating twelve models across four technical approaches, our best-performing model achieves a macro F1 score of 65.11%, revealing substantial differences in drafting conventions despite shared common law traditions. The dataset and results establish a new benchmark and provide large-scale unlabeled data to advance semi-supervised legal NLP.
π Abstract
Multinational companies increasingly require cross-jurisdictional contract review, yet existing legal NLP datasets are largely restricted to a single jurisdiction. We introduce LAUKIN (Legal equivalence dataset of Australia, UK, and INdia), a dataset of clause pairs (AU-UK, UK-IN, IN-AU) labelled for boolean legal equivalence. We develop a novel multi-stage retrieval and reranking pipeline to construct the initial clause pair mapping, with a subset of clause pairs subsequently annotated by legal experts as Equivalent or Not Equivalent. The dataset comprises 14,727 clause pairs from 204 contracts across 8 agreement types, of which 3,000 are manually labelled: 900 train, 600 dev, and 1,500 test. We evaluate 12 models across 4 techniques, achieving a best macro-F1 of 65.11%, establishing LAUKIN as a challenging benchmark. Results reveal that, despite shared legal heritage, drafting conventions diverge significantly across jurisdictions, making cross-jurisdictional equivalence classification non-trivial. LAUKIN also includes 11,727 unlabelled training pairs to support future semi-supervised learning research in legal NLP.