Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval

๐Ÿ“… 2026-06-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

197K/year
๐Ÿค– AI Summary
This work addresses a critical limitation in traditional approximate nearest neighbor (ANN) retrieval for recommendation systems: the reliance on point-estimate embeddings that ignore inherent uncertainty, leading to popularity bias and underexposure of long-tail items. To overcome this, the authors propose DINOSAUR, a novel framework that explicitly incorporates embedding uncertainty into the ANN retrieval pipeline. By stochastically sampling user and item embeddings during both indexing and query stages, DINOSAUR employs a bilateral randomization mechanism to implicitly marginalize uncertaintyโ€”enabling uncertainty-aware candidate generation without requiring modifications to the underlying model or index structure. Experimental results demonstrate that DINOSAUR significantly improves coverage of long-tail items while incurring only a minor reduction in offline recall, thereby validating its effectiveness and practicality.
๐Ÿ“ Abstract
Approximate Nearest Neighbour search indices form the backbone of real-world recommender systems, enabling real-time candidate retrieval over million-item catalogues. Typically, a single point estimate embedding is learnt for every user and every item. At serving time, the user embedding queries the index for relevant items. Since these representations are learnt from sparse interaction data, they are noisy and might fail to capture all the nuances that contribute to ``relevance'' -- ignoring the fundamental uncertainty that is inherent to them. The result is a retrieval pipeline that is systematically biased toward the small minority of popular head items with well-estimated embeddings, at the expense of the long-tail majority of niche, diverse, and serendipitous content. We propose DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval): a simple and infrastructure-compatible framework to incorporate embedding uncertainty into candidate generation. Rather than indexing point estimates, DINOSAUR samples $S_i$ embeddings per item and constructs an index on this augmented set. Analogously, at query time, a user embedding is sampled. This two-sided stochastic retrieval process implicitly marginalises over embedding uncertainty, without requiring changes to model architecture or ANN index infrastructure. On the analytical side, we show that DINOSAUR recovers standard point-estimate retrieval as uncertainty vanishes, and we characterise how increased embedding variance expands the regions of latent space in which uncertain items are retrievable. Reproducible empirical observations align with these expectations, showing large coverage gains with small losses in offline recall.
Problem

Research questions and friction points this paper is trying to address.

Approximate Nearest Neighbour Search
Embedding Uncertainty
Uncertainty-Aware Retrieval
Long-Tail Coverage
Recommender Systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributional Embeddings
Uncertainty-Aware Retrieval
Approximate Nearest Neighbour Search
Long-Tail Coverage
Stochastic Retrieval