🤖 AI Summary
This paper addresses the performance gap between lexical models (e.g., BM25) and semantic models (e.g., BERT-based dense retrieval) in legal information retrieval, focusing on highly repetitive, formulaic, and terminologically constrained judgments from the Court of Justice of the European Union (CJEU). Methodologically, it conducts a systematic paragraph-level retrieval analysis to uncover, for the first time, a nonlinear relationship between textual repetitiveness and model effectiveness, and proposes a domain-data-scale-driven framework to analyze the trade-off between fine-tuning efficacy and robustness. Results show that BM25 outperforms off-the-shelf semantic models on 4/7 evaluation metrics; however, after domain-specific fine-tuning, dense retrievers surpass BM25 on most metrics. Crucially, training data scale significantly influences both performance gains and temporal robustness. These findings provide theoretical grounding and practical guidance for developing accurate, efficient, and interpretable domain-specific legal retrieval systems.
📝 Abstract
Legal passage retrieval is an important task that assists legal practitioners in the time-intensive process of finding relevant precedents to support legal arguments. This study investigates the task of retrieving legal passages or paragraphs from decisions of the Court of Justice of the European Union (CJEU), whose language is highly structured and formulaic, leading to repetitive patterns. Understanding when lexical or semantic models are more effective at handling the repetitive nature of legal language is key to developing retrieval systems that are more accurate, efficient, and transparent for specific legal domains. To this end, we explore when this routinized legal language is better suited for retrieval using methods that rely on lexical and statistical features, such as BM25, or dense retrieval models trained to capture semantic and contextual information. A qualitative and quantitative analysis with three complementary metrics shows that both lexical and dense models perform well in scenarios with more repetitive usage of language, whereas BM25 performs better than the dense models in more nuanced scenarios where repetition and verbatim~quotes are less prevalent and in longer queries. Our experiments also show that BM25 is a strong baseline, surpassing off-the-shelf dense models in 4 out of 7 performance metrics. However, fine-tuning a dense model on domain-specific data led to improved performance, surpassing BM25 in most metrics, and we analyze the effect of the amount of data used in fine-tuning on the model's performance and temporal robustness. The code, dataset and appendix related to this work are available on: https://github.com/larimo/lexsem-legal-ir.