QBD-RankedDataGen: Generating Custom Ranked Datasets for Improving Query-By-Document Search Using LLM-Reranking with Reduced Human Effort

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost and time consumption associated with constructing domain-specific ranking datasets for Query-By-Document (QBD) retrieval, this paper proposes the first large language model (LLM)-driven automated data generation framework tailored to QBD. Our method integrates LLM-based re-ranking, controllable prompt engineering, and lightweight domain-expert feedback, enabling interpretable relevance scoring and human-in-the-loop validation—thereby substantially reducing annotation effort. Evaluated on the TREC QBD benchmark, the generated high-quality training data significantly improves BM25’s ranking performance. Moreover, we demonstrate strong cross-domain generalizability in patent matching and legal document retrieval. The core contribution is the introduction of the first QBD-specific LLM-based data synthesis paradigm, uniquely balancing domain expertise, output interpretability, and annotation efficiency.

Technology Category

Application Category

📝 Abstract
The Query-By-Document (QBD) problem is an information retrieval problem where the query is a document, and the retrieved candidates are documents that match the query document, often in a domain or query specific manner. This can be crucial for tasks such as patent matching, legal or compliance case retrieval, and academic literature review. Existing retrieval methods, including keyword search and document embeddings, can be optimized with domain-specific datasets to improve QBD search performance. However, creating these domain-specific datasets is often costly and time-consuming. Our work introduces a process to generate custom QBD-search datasets and compares a set of methods to use in this problem, which we refer to as QBD-RankedDatagen. We provide a comparative analysis of our proposed methods in terms of cost, speed, and the human interface with the domain experts. The methods we compare leverage Large Language Models (LLMs) which can incorporate domain expert input to produce document scores and rankings, as well as explanations for human review. The process and methods for it that we present can significantly reduce human effort in dataset creation for custom domains while still obtaining sufficient expert knowledge for tuning retrieval models. We evaluate our methods on QBD datasets from the Text Retrieval Conference (TREC) and finetune the parameters of the BM25 model -- which is used in many industrial-strength search engines like OpenSearch -- using the generated data.
Problem

Research questions and friction points this paper is trying to address.

Generating custom datasets for Query-By-Document search improvement
Reducing human effort in domain-specific dataset creation
Leveraging LLMs for document scoring and ranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates custom QBD-search datasets efficiently
Leverages LLMs for document scoring and ranking
Reduces human effort with expert knowledge integration
🔎 Similar Papers
No similar papers found.