ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

To address the limited semantic modeling capability of text embedding models in inference-intensive document retrieval, this paper proposes the ReMixer–Redapter collaborative framework. ReMixer generates high-difficulty synthetic training data via controllable semantic perturbation and logical chain augmentation to mitigate data triviality; Redapter introduces an inference-intensity-aware dynamic sample weighting mechanism for adaptive optimization. The framework is compatible with multi-scale backbone architectures. On the BRIGHT benchmark, it achieves 38.1 nDCG@10, establishing a new state-of-the-art. Extensive experiments demonstrate that the proposed data synthesis paradigm and adaptive learning strategy significantly enhance the model’s capacity to capture complex semantic relationships—such as multi-hop reasoning and implicit logic—thereby introducing a novel inference-oriented embedding learning paradigm.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce ReasonEmbed, a novel text embedding model developed for reasoning-intensive document retrieval. Our work includes three key technical contributions. First, we propose ReMixer, a new data synthesis method that overcomes the triviality problem prevalent in previous synthetic datasets, enabling large-scale production of 82K high-quality training samples. Second, we design Redapter, a self-adaptive learning algorithm that dynamically adjusts training each sample's weight based on its reasoning intensity. This allows the model to effectively capture the complex semantic relationships between queries and documents. Third, we implement ReasonEmbed across multiple backbones of varying sizes, all of which achieve superior performance on reasoning-intensive retrieval tasks. Notably, our ReasonEmbed-Qwen3-8B model offers a record-high nDCG@10 score of 38.1 on the BRIGHT benchmark, which significantly outperforms existing text embedding models. We will fully open-source our created resources in ReasonEmbed to push forward the research advancement in this field.

Problem

Research questions and friction points this paper is trying to address.

Overcoming triviality in synthetic datasets for reasoning-intensive retrieval

Dynamically adjusting training weights based on reasoning intensity

Achieving superior performance on reasoning-intensive document retrieval tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReMixer synthesizes high-quality training data at scale

Redapter algorithm dynamically adjusts sample weights adaptively

ReasonEmbed implementation achieves superior reasoning retrieval performance

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval