Charge as a Construct-Validity Factor in Chinese Legal Case Retrieval: A Cross-Benchmark Audit

๐Ÿ“… 2026-06-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses a critical limitation in existing Chinese legal case retrieval benchmarks, which overly rely on charge matching and may thus misrepresent a modelโ€™s true legal reasoning capability. Through cross-benchmark auditing, we reveal the dominant influence of charge labels on retrieval performance metrics and propose a reusable Charge-Controlled Evaluation (CCE) protocolโ€”the first systematic framework to quantify the confounding effect of charges as a construct-irrelevant factor. Our empirical analysis employs multiple strategies, including shared primary-charge ranking, BM25 baselines, predicted-charge cascading, zero-shot charge pools, and cluster-based bootstrap confidence intervals. On LeCaRDv2, charge matching alone accounts for 99.2% of the performance gap between BM25 and state-of-the-art models. We release fully reproducible scripts, schemas, and evaluation protocols to support robust future benchmarking.
๐Ÿ“ Abstract
Chinese Legal Case Retrieval (LCR) benchmarks grade a reference judgment relevant when its legal characterization matches the query, and strong systems now reach NDCG@10 of 0.85-0.88. Most of the BM25-to-best-trained gap is recoverable with no retrieval model: ranking candidates only by shared primary charge, broken by BM25, closes 99.2% of it on LeCaRDv2 -- with no detectable difference from the best-trained system. This reflects benchmark design: LeCaRDv2 defines top relevance via the crime's key constitutive elements, which encode the charge, so same-charge cases are relevant by construction (relevance lift 4.49; charge-to-relevance macro-AUC 0.871). Holding charge fixed, the trained reranker's advantage over BM25 collapses to a small within-charge residual (+0.026 NDCG@10, cluster-bootstrap CI excluding zero, about a quarter), the only non-definitional positive. The effect is not uniform: the same rule recovers 84.3% on LeCaRDv1 and is out of spec on CAIL2022, with the charge-to-relevance signal weakening in step (macro-AUC 0.871/0.759/0.728); a predicted-charge cascade reproduces 76.6% on LeCaRDv2 but does not transfer. The construct is also cashable at first stage: an exploratory zero-training charge-pool channel lifts LeCaRDv2 recall (R@100 +0.025, wrong-charge controls hurt), reported as a positive control for the confound, not a retrieval method or novelty claim. Charge is thus a high-leverage construct-validity factor at the benchmark level -- not auniform explanation of NDCG@10, and not evidence that any system relies on charge. We package established construct-validity and partial-input checks as a reusable charge-controlled protocol (CCE); on all three benchmarks its triggers come back null or descriptive, behaving as designed. We release the scripts, schema, and protocol so future benchmarks can be screened before their NDCG@10 is read as legal-reasoning ability.
Problem

Research questions and friction points this paper is trying to address.

charge
construct validity
legal case retrieval
benchmarking
relevance evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

charge-controlled evaluation
construct validity
legal case retrieval
benchmark auditing
relevance confounding
๐Ÿ”Ž Similar Papers