🤖 AI Summary
This work addresses the pervasive position bias in dense retrieval models, which significantly degrades recall performance for relevant content located toward the end of passages. The authors propose a training-free, inference-time attention calibration mechanism that interpolates attention weights using an adjustable strength coefficient λ. This approach integrates hierarchical calibration and basket sampling strategies, making it compatible with both <s>-token pooling and last-token pooling architectures. Evaluated under a unified default configuration, the method consistently enhances positional fairness across diverse models, architectures, and languages while preserving or even improving overall retrieval effectiveness. Specifically, it substantially increases the harmonic mean of nDCG@10 across position groups on FineWeb-PosQ and comprehensively reduces the position sensitivity index on the multilingual, multidomain PosIR benchmark.
📝 Abstract
Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng et al., 2025). We ask whether this bias can be reduced at inference time, without retraining and without sacrificing overall retrieval effectiveness. To this end, we adapt inference-time attention calibration (Schuhmacher et al., 2026) to downstream retrieval and extend it with a strength coefficient lambda that interpolates between the original and fully calibrated attention distributions. Across three embedding models on SQuAD-PosQ and FineWeb-PosQ, we examine how basket size, calibrated layer set, and strength affect the trade-off between positional fairness and retrieval effectiveness, finding that partial calibration frequently outperforms full calibration. A single configuration (B=128, lambda=0.5, 50% layer depth) improves the harmonic mean of nDCG@10 across positional groups on FineWeb-PosQ for all three models without per-model tuning, and applies to both <s>-pooled and last-token-pooled architectures. This default configuration transfers without modification to PosIR, which spans 10 languages and 31 domains, reducing the Position Sensitivity Index in all 16 length-quartile x model x retrieval-setting combinations, while preserving or improving aggregate nDCG@10. We release our extended codebase at https://github.com/impresso/fair-sentence-transformers