DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

To address the dual challenges of inefficient large language model (LLM) inference and poor generalization of lightweight dense retrievers, this paper proposes an LLM-driven framework for training compact dense retrievers. Methodologically: (i) a pruned LLM serves as the backbone encoder, balancing representational capacity with computational efficiency; (ii) we introduce the first LLM-based diverse generative data augmentation strategy, substantially improving generalization across multilingual and long-context scenarios; and (iii) we design a single-stage contrastive learning paradigm for end-to-end retrieval optimization. Experiments demonstrate that our approach consistently outperforms conventional dense retrievers across multiple tasks and multilingual benchmarks. Crucially, it achieves strong generalization and high retrieval accuracy while maintaining low latency and a small parameter count—effectively unifying efficiency, scalability, and performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated strong effectiveness and robustness while fine-tuned as dense retrievers. However, their large parameter size brings significant inference time computational challenges, including high encoding costs for large-scale corpora and increased query latency, limiting their practical deployment. While smaller retrievers offer better efficiency, they often fail to generalize effectively with limited supervised fine-tuning data. In this work, we introduce DRAMA, a training framework that leverages LLMs to train smaller generalizable dense retrievers. In particular, we adopt pruned LLMs as the backbone and train on diverse LLM-augmented data in a single-stage contrastive learning setup. Experiments show that DRAMA offers better multilingual and long-context capabilities than traditional encoder-based retrievers, and achieves strong performance across multiple tasks and languages. These highlight the potential of connecting the training of smaller retrievers with the growing advancements in LLMs, bridging the gap between efficiency and generalization.

Problem

Research questions and friction points this paper is trying to address.

Enhance smaller retrievers' generalization

Reduce inference time computational costs

Bridge efficiency and generalization gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs for training

Uses pruned LLMs backbone

Single-stage contrastive learning setup

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval