ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited semantic modeling capability of text embedding models in inference-intensive document retrieval, this paper proposes the ReMixer–Redapter collaborative framework. ReMixer generates high-difficulty synthetic training data via controllable semantic perturbation and logical chain augmentation to mitigate data triviality; Redapter introduces an inference-intensity-aware dynamic sample weighting mechanism for adaptive optimization. The framework is compatible with multi-scale backbone architectures. On the BRIGHT benchmark, it achieves 38.1 nDCG@10, establishing a new state-of-the-art. Extensive experiments demonstrate that the proposed data synthesis paradigm and adaptive learning strategy significantly enhance the model’s capacity to capture complex semantic relationships—such as multi-hop reasoning and implicit logic—thereby introducing a novel inference-oriented embedding learning paradigm.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce ReasonEmbed, a novel text embedding model developed for reasoning-intensive document retrieval. Our work includes three key technical contributions. First, we propose ReMixer, a new data synthesis method that overcomes the triviality problem prevalent in previous synthetic datasets, enabling large-scale production of 82K high-quality training samples. Second, we design Redapter, a self-adaptive learning algorithm that dynamically adjusts training each sample's weight based on its reasoning intensity. This allows the model to effectively capture the complex semantic relationships between queries and documents. Third, we implement ReasonEmbed across multiple backbones of varying sizes, all of which achieve superior performance on reasoning-intensive retrieval tasks. Notably, our ReasonEmbed-Qwen3-8B model offers a record-high nDCG@10 score of 38.1 on the BRIGHT benchmark, which significantly outperforms existing text embedding models. We will fully open-source our created resources in ReasonEmbed to push forward the research advancement in this field.
Problem

Research questions and friction points this paper is trying to address.

Overcoming triviality in synthetic datasets for reasoning-intensive retrieval
Dynamically adjusting training weights based on reasoning intensity
Achieving superior performance on reasoning-intensive document retrieval tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReMixer synthesizes high-quality training data at scale
Redapter algorithm dynamically adjusts sample weights adaptively
ReasonEmbed implementation achieves superior reasoning retrieval performance
🔎 Similar Papers
No similar papers found.
Jianlyu Chen
Jianlyu Chen
University of Science and Technology of China
Natural Language ProcessingInformation Retrieval
J
Junwei Lan
University of Science and Technology of China, Beijing Academy of Artificial Intelligence
Chaofan Li
Chaofan Li
Beijing University of Posts and Telecommunications
NLP
D
Defu Lian
University of Science and Technology of China, State Key Laboratory of Cognitive Intelligence
Z
Zheng Liu
Beijing Academy of Artificial Intelligence, Hong Kong Polytechnic University