🤖 AI Summary
Traditional legal case retrieval relies on surface-level lexical matching, which fails to capture the similarity in legal reasoning between precedents. This work proposes a two-stage, paragraph-aware retrieval framework: the first stage combines BM25 with dense vector retrieval to generate a high-recall candidate set, while the second stage performs fine-grained alignment across structured paragraphs—such as facts, disputed issues, rulings, and reasoning—and fuses multi-source signals using query-level Z-score normalization and learnable paragraph weights. By introducing paragraph-level alignment and query-adaptive normalization, the method significantly outperforms strong baselines on judicial benchmarks, achieving higher precision in analogous case matching without sacrificing coverage, and further enables interpretable, paragraph-level justifications for retrieval results.
📝 Abstract
Finding truly analogous precedents requires capturing legal reasoning beyond surface word overlap. We present a two-stage, section-aware framework for legal case retrieval that first segments raw judgments into facts, issues, decision, and reasoning using a deterministic large language model (LLM) offline. In Stage 1, we combine parallel lexical (BM25) and semantic (dense ANN) whole-document searches via Reciprocal Rank Fusion (RRF) to form a high-recall candidate pool. In Stage 2, we perform fine-grained, like-for-like comparisons (e.g., query reasoning vs. candidate reasoning). To address the scale mismatch between unbounded lexical scores and cosine similarities, we apply query-wise Z-score normalization before aggregating signals with learned section weights. For the top results, the system returns the relevant section text with a concise, grounded rationale and party-stance labels. We evaluate on a jurisdiction-scale benchmark, demonstrating consistent gains over strong lexical and neural baselines while maintaining high candidate coverage