Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

πŸ“… 2025-11-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Text embedding models exhibit systematic mean biasβ€”outputs decompose into a sentence-invariant bias component and an effective semantic component. Method: We propose a plug-and-play, training-free renormalization technique grounded in theoretical analysis and empirical validation, demonstrating that vector projection subtraction outperforms direct mean subtraction for bias correction. The method requires only post-hoc processing of pretrained embeddings. Contribution/Results: Evaluated on 38 mainstream multilingual embedding models, it improves retrieval performance on the MTEB benchmark by 9.7Οƒ, classification by 3.1Οƒ, and other tasks by an average of 0.8Οƒ. This work is the first to uncover the structural commonality of embedding bias across models and establishes an interpretable, general-purpose, training-free paradigm for enhancing embedding quality and semantic fidelity.

Technology Category

Application Category

πŸ“ Abstract
We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector $e$ can be decomposed as $ ilde{e} + mu$, where $mu$ is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 $sigma$ on retrieval tasks, 3.1 $sigma$ on classification tasks, and 0.8 $sigma$ on other types of tasks. Renormalization has two variants: directly subtracting $mu$ from $e$, or subtracting the projection of $e$ onto $mu$. We theoretically predict that the latter performs better, and our experiments confirm this prediction.
Problem

Research questions and friction points this paper is trying to address.

Correcting consistent mean bias in text embedding models
Improving performance on multilingual embedding benchmarks
Providing training-free bias removal through renormalization techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Renormalization corrects embedding bias without training
Subtracts projection of vector onto bias direction
Plug-and-play solution improves MMTEB benchmark performance
πŸ”Ž Similar Papers
No similar papers found.
Xingyu Ren
Xingyu Ren
Ph.D. graduate, Shanghai Jiao Tong University
Face ModelingGenerative AI
Y
Youran Sun
Dept. of Mathematical Sciences, Tsinghua University, Beijing, China
H
Haoyu Liang
Dept. of Computer Science and Technology, Tsinghua University, Beijing, China