🤖 AI Summary
To address the limited semantic modeling capability of text embedding models in inference-intensive document retrieval, this paper proposes the ReMixer–Redapter collaborative framework. ReMixer generates high-difficulty synthetic training data via controllable semantic perturbation and logical chain augmentation to mitigate data triviality; Redapter introduces an inference-intensity-aware dynamic sample weighting mechanism for adaptive optimization. The framework is compatible with multi-scale backbone architectures. On the BRIGHT benchmark, it achieves 38.1 nDCG@10, establishing a new state-of-the-art. Extensive experiments demonstrate that the proposed data synthesis paradigm and adaptive learning strategy significantly enhance the model’s capacity to capture complex semantic relationships—such as multi-hop reasoning and implicit logic—thereby introducing a novel inference-oriented embedding learning paradigm.
📝 Abstract
In this paper, we introduce ReasonEmbed, a novel text embedding model developed for reasoning-intensive document retrieval. Our work includes three key technical contributions. First, we propose ReMixer, a new data synthesis method that overcomes the triviality problem prevalent in previous synthetic datasets, enabling large-scale production of 82K high-quality training samples. Second, we design Redapter, a self-adaptive learning algorithm that dynamically adjusts training each sample's weight based on its reasoning intensity. This allows the model to effectively capture the complex semantic relationships between queries and documents. Third, we implement ReasonEmbed across multiple backbones of varying sizes, all of which achieve superior performance on reasoning-intensive retrieval tasks. Notably, our ReasonEmbed-Qwen3-8B model offers a record-high nDCG@10 score of 38.1 on the BRIGHT benchmark, which significantly outperforms existing text embedding models. We will fully open-source our created resources in ReasonEmbed to push forward the research advancement in this field.