VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

πŸ“… 2025-07-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the scarcity of high-quality, domain-specific annotated corpora for low-resource languages like Vietnamese in the legal domain, this paper introduces VLADβ€”the first large-scale, domain-specialized Vietnamese Legal Question Answering dataset, covering statutory provisions, judicial precedents, and practical Q&A. VLAD employs a rigorous annotation pipeline combining expert human labeling with statistical validation to ensure high accuracy and domain fidelity of question-answer pairs relative to supporting passages. The dataset undergoes comprehensive multi-dimensional quality assessment and demonstrates empirical efficacy: models fine-tuned on VLAD substantially outperform zero-shot baselines, achieving up to +28.6 F1 points on Vietnamese legal NLP benchmarks for both question answering and legal information retrieval. This work bridges a critical gap in low-resource legal NLP resources, providing foundational infrastructure for supervised model training, robust evaluation, and cross-lingual legal AI research.

Technology Category

Application Category

πŸ“ Abstract
The advent of large language models (LLMs) has led to significant achievements in various domains, including legal text processing. Leveraging LLMs for legal tasks is a natural evolution and an increasingly compelling choice. However, their capabilities are often portrayed as greater than they truly are. Despite the progress, we are still far from the ultimate goal of fully automating legal tasks using artificial intelligence (AI) and natural language processing (NLP). Moreover, legal systems are deeply domain-specific and exhibit substantial variation across different countries and languages. The need for building legal text processing applications for different natural languages is, therefore, large and urgent. However, there is a big challenge for legal NLP in low-resource languages such as Vietnamese due to the scarcity of resources and annotated data. The need for labeled legal corpora for supervised training, validation, and supervised fine-tuning is critical. In this paper, we introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain. We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness through experiments with state-of-the-art models on legal information retrieval and question-answering tasks.
Problem

Research questions and friction points this paper is trying to address.

Lack of Vietnamese legal datasets for NLP tasks
Need for high-quality labeled legal corpora in low-resource languages
Challenges in automating legal tasks with AI for specific domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLQA dataset for Vietnamese legal domain
Statistical analysis of dataset quality
State-of-the-art models for legal tasks
πŸ”Ž Similar Papers
No similar papers found.
T
Tan-Minh Nguyen
Japan Advanced Institute of Science and Technology, Japan
H
Hoang-Trung Nguyen
VNU University of Engineering and Technology, Vietnam
T
Trong-Khoi Dao
VNU University of Law, Vietnam
Xuan-Hieu Phan
Xuan-Hieu Phan
Assoc. Prof. at UET, Vietnam National University, Hanoi
Data MiningAI and NLPBusiness Analytics
H
Ha-Thanh Nguyen
National Institute of Informatics, Japan
Thi-Hai-Yen Vuong
Thi-Hai-Yen Vuong
VNU University of Engineering and Technology, Vietnam National University, Hanoi
Data minningNLPLegal NLPSymbolic AI