Read the Docs Before Rewriting: Equip Rewriter with Domain Knowledge via Continual Pre-training

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In professional-domain RAG-based question-answering systems, retrieval failure often arises from semantic misalignment between user queries and domain documents. To address this, we propose a query rewriting method based on domain-adaptive continual pretraining. Our core innovation is a “pre-exam review” mechanism: the rewriting model undergoes knowledge-injection continual pretraining on domain-specific documents—followed by supervised fine-tuning—to internalize domain terminology, syntactic patterns, and cross-modal semantic mappings. This approach effectively bridges the semantic gap between queries and documents. Empirical evaluation across multiple professional domains (e.g., law, medicine) and general-purpose benchmarks demonstrates an average 12.7% improvement in query rewriting accuracy and an 8.3% gain in downstream QA F1 score. Results confirm the critical role of domain prior knowledge in enhancing rewriting fidelity and retrieval effectiveness.

Technology Category

Application Category

📝 Abstract

A Retrieval-Augmented Generation (RAG)-based question-answering (QA) system enhances a large language model's knowledge by retrieving relevant documents based on user queries. Discrepancies between user queries and document phrasings often necessitate query rewriting. However, in specialized domains, the rewriter model may struggle due to limited domain-specific knowledge. To resolve this, we propose the R&R (Read the doc before Rewriting) rewriter, which involves continual pre-training on professional documents, akin to how students prepare for open-book exams by reviewing textbooks. Additionally, it can be combined with supervised fine-tuning for improved results. Experiments on multiple datasets demonstrate that R&R excels in professional QA across multiple domains, effectively bridging the query-document gap, while maintaining good performance in general scenarios, thus advancing the application of RAG-based QA systems in specialized fields.

Problem

Research questions and friction points this paper is trying to address.

Enhancing query rewriting in specialized domains with limited knowledge

Bridging the gap between user queries and document phrasings

Improving RAG-based QA systems for professional fields

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual pre-training on professional documents

Combines with supervised fine-tuning

Enhances RAG-based QA in specialized domains

🔎 Similar Papers

No similar papers found.

Authors to Follow