KoBLEX: Open Legal Question Answering with Multi-hop Reasoning

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing legal LLM benchmarks lack systematic evaluation of open-ended, clause-driven multi-hop legal question answering (QA). This work introduces KoBLEX, the first high-quality Korean legal QA benchmark explicitly designed to support multi-hop reasoning. We propose ParSeR, a novel method that integrates parametric clause generation with a three-stage sequential retrieval process to enable clause-guided precise reasoning. Additionally, we introduce LF-Eval, an automated evaluation metric that quantifies legal fidelity of answers, achieving strong agreement with human judgments (ρ = 0.92). Experiments demonstrate that ParSeR significantly outperforms strong baselines across multiple large language models: it improves F1 by 37.91 and LF-Eval by 30.81 over standard GPT-4o retrieval, while maintaining robust performance across varying reasoning depths.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLM) have achieved remarkable performances in general domains and are now extending into the expert domain of law. Several benchmarks have been proposed to evaluate LLMs' legal capabilities. However, these benchmarks fail to evaluate open-ended and provision-grounded Question Answering (QA). To address this, we introduce a Korean Benchmark for Legal EXplainable QA (KoBLEX), designed to evaluate provision-grounded, multi-hop legal reasoning. KoBLEX includes 226 scenario-based QA instances and their supporting provisions, created using a hybrid LLM-human expert pipeline. We also propose a method called Parametric provision-guided Selection Retrieval (ParSeR), which uses LLM-generated parametric provisions to guide legally grounded and reliable answers. ParSeR facilitates multi-hop reasoning on complex legal questions by generating parametric provisions and employing a three-stage sequential retrieval process. Furthermore, to better evaluate the legal fidelity of the generated answers, we propose Legal Fidelity Evaluation (LF-Eval). LF-Eval is an automatic metric that jointly considers the question, answer, and supporting provisions and shows a high correlation with human judgments. Experimental results show that ParSeR consistently outperforms strong baselines, achieving the best results across multiple LLMs. Notably, compared to standard retrieval with GPT-4o, ParSeR achieves +37.91 higher F1 and +30.81 higher LF-Eval. Further analyses reveal that ParSeR efficiently delivers consistent performance across reasoning depths, with ablations confirming the effectiveness of ParSeR.

Problem

Research questions and friction points this paper is trying to address.

Evaluating open-ended legal question answering with multi-hop reasoning

Addressing lack of provision-grounded benchmarks in legal AI

Improving legal fidelity and reliability of LLM-generated answers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid LLM-human expert pipeline creation

Parametric provision-guided Selection Retrieval method

Three-stage sequential retrieval for multi-hop reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow