🤖 AI Summary
In professional-domain RAG-based question-answering systems, retrieval failure often arises from semantic misalignment between user queries and domain documents. To address this, we propose a query rewriting method based on domain-adaptive continual pretraining. Our core innovation is a “pre-exam review” mechanism: the rewriting model undergoes knowledge-injection continual pretraining on domain-specific documents—followed by supervised fine-tuning—to internalize domain terminology, syntactic patterns, and cross-modal semantic mappings. This approach effectively bridges the semantic gap between queries and documents. Empirical evaluation across multiple professional domains (e.g., law, medicine) and general-purpose benchmarks demonstrates an average 12.7% improvement in query rewriting accuracy and an 8.3% gain in downstream QA F1 score. Results confirm the critical role of domain prior knowledge in enhancing rewriting fidelity and retrieval effectiveness.
📝 Abstract
A Retrieval-Augmented Generation (RAG)-based question-answering (QA) system enhances a large language model's knowledge by retrieving relevant documents based on user queries. Discrepancies between user queries and document phrasings often necessitate query rewriting. However, in specialized domains, the rewriter model may struggle due to limited domain-specific knowledge. To resolve this, we propose the R&R (Read the doc before Rewriting) rewriter, which involves continual pre-training on professional documents, akin to how students prepare for open-book exams by reviewing textbooks. Additionally, it can be combined with supervised fine-tuning for improved results. Experiments on multiple datasets demonstrate that R&R excels in professional QA across multiple domains, effectively bridging the query-document gap, while maintaining good performance in general scenarios, thus advancing the application of RAG-based QA systems in specialized fields.