Read the Docs Before Rewriting: Equip Rewriter with Domain Knowledge via Continual Pre-training

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In professional-domain RAG-based question-answering systems, retrieval failure often arises from semantic misalignment between user queries and domain documents. To address this, we propose a query rewriting method based on domain-adaptive continual pretraining. Our core innovation is a “pre-exam review” mechanism: the rewriting model undergoes knowledge-injection continual pretraining on domain-specific documents—followed by supervised fine-tuning—to internalize domain terminology, syntactic patterns, and cross-modal semantic mappings. This approach effectively bridges the semantic gap between queries and documents. Empirical evaluation across multiple professional domains (e.g., law, medicine) and general-purpose benchmarks demonstrates an average 12.7% improvement in query rewriting accuracy and an 8.3% gain in downstream QA F1 score. Results confirm the critical role of domain prior knowledge in enhancing rewriting fidelity and retrieval effectiveness.

Technology Category

Application Category

📝 Abstract
A Retrieval-Augmented Generation (RAG)-based question-answering (QA) system enhances a large language model's knowledge by retrieving relevant documents based on user queries. Discrepancies between user queries and document phrasings often necessitate query rewriting. However, in specialized domains, the rewriter model may struggle due to limited domain-specific knowledge. To resolve this, we propose the R&R (Read the doc before Rewriting) rewriter, which involves continual pre-training on professional documents, akin to how students prepare for open-book exams by reviewing textbooks. Additionally, it can be combined with supervised fine-tuning for improved results. Experiments on multiple datasets demonstrate that R&R excels in professional QA across multiple domains, effectively bridging the query-document gap, while maintaining good performance in general scenarios, thus advancing the application of RAG-based QA systems in specialized fields.
Problem

Research questions and friction points this paper is trying to address.

Enhancing query rewriting in specialized domains with limited knowledge
Bridging the gap between user queries and document phrasings
Improving RAG-based QA systems for professional fields
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual pre-training on professional documents
Combines with supervised fine-tuning
Enhances RAG-based QA in specialized domains
🔎 Similar Papers
No similar papers found.
Q
Qi Wang
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China; Peng Cheng Laboratory, Shenzhen 518066, China; University of Chinese Academy of Sciences, Beijing 100049, China
Yixuan Cao
Yixuan Cao
Shenzhen University
Software EngineeringSecurityKernel & CompilerTesting & VerificationBig Data
Y
Yifan Liu
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
J
Jiangtao Zhao
China Merchants Securities Co., Ltd, Shenzhen 518046, China
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing