Conventional Contrastive Learning Often Falls Short: Improving Dense Retrieval with Cross-Encoder Listwise Distillation and Synthetic Data

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

Traditional contrastive learning (e.g., InfoNCE) often degrades the performance of state-of-the-art dense retrieval models during corpus-level fine-tuning. This paper identifies and addresses this issue by proposing a robust cross-encoder listwise knowledge distillation framework. Instead of relying on biased human-authored queries, it leverages diverse synthetic queries—declarative, keyword-based, and question-form—generated by large language models. By operating at the ranking-list level rather than sample-wise contrastive matching, the method mitigates instability inherent in pairwise or pointwise contrastive objectives and preserves holistic ranking structure from the teacher model. Evaluated on multiple standard benchmarks, the distilled BERT-based embedding models achieve new state-of-the-art results, outperforming all previously published models in retrieval effectiveness. The code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract

We investigate improving the retrieval effectiveness of embedding models through the lens of corpus-specific fine-tuning. Prior work has shown that fine-tuning with queries generated using a dataset's retrieval corpus can boost retrieval effectiveness for the dataset. However, we find that surprisingly, fine-tuning using the conventional InfoNCE contrastive loss often reduces effectiveness in state-of-the-art models. To overcome this, we revisit cross-encoder listwise distillation and demonstrate that, unlike using contrastive learning alone, listwise distillation can help more consistently improve retrieval effectiveness across multiple datasets. Additionally, we show that synthesizing more training data using diverse query types (such as claims, keywords, and questions) yields greater effectiveness than using any single query type alone, regardless of the query type used in evaluation. Our findings further indicate that synthetic queries offer comparable utility to human-written queries for training. We use our approach to train an embedding model that achieves state-of-the-art effectiveness among BERT embedding models. We release our model and both query generation and training code to facilitate further research.

Problem

Research questions and friction points this paper is trying to address.

Improving dense retrieval with cross-encoder listwise distillation

Overcoming InfoNCE contrastive loss limitations in fine-tuning

Enhancing retrieval using diverse synthetic query types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-encoder listwise distillation improves retrieval

Synthetic diverse query types enhance effectiveness

Synthetic queries match human-written training utility

🔎 Similar Papers

Rank-DistiLLM: Closing the Effectiveness Gap Between Cross-Encoders and LLMs for Passage Re-Ranking