Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of efficiently adapting general-purpose large language models into high-performing biomedical retrievers without extensive fine-tuning or compromising their general capabilities. To this end, the authors propose the Synthesize-Train-Merge (STM) framework, which enables modular domain adaptation through three key components: synthesizing hard negative samples, optimizing retrieval-oriented prompts, and merging task-specific models. Notably, STM avoids the need for large-scale pretraining. Evaluated across 12 medical and general retrieval tasks from the MTEB benchmark, the approach improves task-specific performance by up to 23.5% (averaging +7.5%) over strong baselines. The merged model consistently outperforms both single-task experts and robust baselines, effectively balancing domain-specific effectiveness with broad generalization.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general LLMs into high-performing, domain-specialized retrievers, preserving general-domain capabilities while excelling on specialized tasks.

Problem

Research questions and friction points this paper is trying to address.

LLM adaptation

biomedical retrievers

domain-specific retrieval

retrieval-augmented generation

model specialization

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic hard negatives

retrieval prompt optimization

model merging

domain-specialized retrievers

LLM adaptation

🔎 Similar Papers

No similar papers found.

Authors to Follow