SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity, high construction cost, and domain adaptation challenges of supervised fine-tuning (SFT) data in specialized domains, this paper proposes a retrieval-augmented method for automatic instruction-data generation. Starting from a small set of human-authored domain-specific questions as seeds, the method leverages large language models (LLMs) for semantically consistent question expansion and dynamically retrieves relevant domain documents to generate factually grounded, precise responses—yielding high-quality, broad-coverage instruction-response pairs. Its core innovation lies in the tight integration of lightweight human curation, LLM-based generation, and real-time retrieval, thereby jointly ensuring data diversity, factual accuracy, and domain expertise. Experiments demonstrate that the generated data significantly improves SFT performance across multiple professional downstream tasks. The project—including all generated data and source code—is publicly released to support reproducible research and practical deployment.

Technology Category

Application Category

📝 Abstract
Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)
Problem

Research questions and friction points this paper is trying to address.

Creating domain-specific instruction datasets for fine-tuning
Overcoming data scarcity and domain constraints in SFT
Enhancing LLM performance with retrieval-based dataset generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieves domain resources for accurate answers
Expands questions using large language model
Generates high-quality instruction datasets for SFT
🔎 Similar Papers
No similar papers found.
I
Iman Barati
Iran University of Science and Technology
M
Mostafa Amiri
University of Tehran
Heshaam Faili
Heshaam Faili
Full Professor, University of Tehran
Natural Language ProcessingSocial Network