LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key challenges in domain-specific RAG—including degraded performance of pretrained embeddings, high computational overhead of LLM-based retrievers, and the trade-off between high-quality fine-tuning data and optimal text chunking—this paper proposes a model-agnostic, two-stage retrieval-augmented framework. In Stage I, an LLM serves as both annotator and validator to generate contextually complete and semantically reliable triplet supervision data. In Stage II, contrastive learning fine-tunes embedding models, while lightweight text clustering optimizes retrieval structure. The framework is architecture-agnostic, enabling plug-and-play integration and continuous evolution. Evaluated across multiple domain-specific benchmarks, it significantly outperforms mainstream baselines, achieving strong scalability, low latency, and efficiency under moderate hardware requirements—establishing a new paradigm for robust, production-ready domain RAG deployment.

Technology Category

Application Category

📝 Abstract
Retrieval Augmented Generation (RAG) systems often struggle with domain-specific knowledge due to performance deterioration of pre-trained embeddings and prohibitive computational costs of large language model (LLM)-based retrievers. While fine-tuning data augmentation embedding models offers a promising direction, its effectiveness is limited by the need for high-quality training data and reliable chunking strategies that preserve contextual integrity. We propose LMAR (Language Model Augmented Retriever), a model-agnostic framework that addresses these challenges by combining LLM-guided data synthesis with contrastive embedding adaptation and efficient text clustering. LMAR consists of a two-stage pipeline: (1) Triplet sampling and synthetic data augmentation, where LLMs act as both labeler and validator to ensure high-fidelity supervision throughout the pipeline. Experimental results across multiple domain-specific benchmark datasets demonstrate that LMAR outperforms multiple baseline models, while maintaining moderate hardware requirements and low latency. Its model-agnostic nature further enables seamless integration with emerging RAG architectures and text embedding models, ensuring continual improvements without redesigning the pipeline. These results highlight LMAR as a practical and cost-effective solution for scalable domain-specific adaptation.
Problem

Research questions and friction points this paper is trying to address.

Improves domain-specific retrieval in RAG systems
Addresses embedding performance and computational cost issues
Enhances data quality and chunking strategies for adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided data synthesis for high-fidelity supervision
Contrastive embedding adaptation for domain-specific performance
Efficient text clustering to preserve contextual integrity
🔎 Similar Papers
Y
Yao Zhao
Department of Applied Mathematics and Statistics, Johns Hopkins University
Y
Yantian Ding
Department of Applied Mathematics and Statistics, Johns Hopkins University
Z
Zhiyue Zhang
Department of Applied Mathematics and Statistics, Johns Hopkins University
D
Dapeng Yao
Department of Applied Mathematics and Statistics, Johns Hopkins University
Yanxun Xu
Yanxun Xu
Johns Hopkins University
BayesianClinical trial DesignElectronic Health Record DataNetwork Data