Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses a critical limitation in traditional approximate nearest neighbor (ANN) retrieval for recommendation systems: the reliance on point-estimate embeddings that ignore inherent uncertainty, leading to popularity bias and underexposure of long-tail items. To overcome this, the authors propose DINOSAUR, a novel framework that explicitly incorporates embedding uncertainty into the ANN retrieval pipeline. By stochastically sampling user and item embeddings during both indexing and query stages, DINOSAUR employs a bilateral randomization mechanism to implicitly marginalize uncertainty—enabling uncertainty-aware candidate generation without requiring modifications to the underlying model or index structure. Experimental results demonstrate that DINOSAUR significantly improves coverage of long-tail items while incurring only a minor reduction in offline recall, thereby validating its effectiveness and practicality.

📝 Abstract

Approximate Nearest Neighbour search indices form the backbone of real-world recommender systems, enabling real-time candidate retrieval over million-item catalogues. Typically, a single point estimate embedding is learnt for every user and every item. At serving time, the user embedding queries the index for relevant items. Since these representations are learnt from sparse interaction data, they are noisy and might fail to capture all the nuances that contribute to ``relevance'' -- ignoring the fundamental uncertainty that is inherent to them. The result is a retrieval pipeline that is systematically biased toward the small minority of popular head items with well-estimated embeddings, at the expense of the long-tail majority of niche, diverse, and serendipitous content. We propose DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval): a simple and infrastructure-compatible framework to incorporate embedding uncertainty into candidate generation. Rather than indexing point estimates, DINOSAUR samples $S_i$ embeddings per item and constructs an index on this augmented set. Analogously, at query time, a user embedding is sampled. This two-sided stochastic retrieval process implicitly marginalises over embedding uncertainty, without requiring changes to model architecture or ANN index infrastructure. On the analytical side, we show that DINOSAUR recovers standard point-estimate retrieval as uncertainty vanishes, and we characterise how increased embedding variance expands the regions of latent space in which uncertain items are retrievable. Reproducible empirical observations align with these expectations, showing large coverage gains with small losses in offline recall.

Problem

Research questions and friction points this paper is trying to address.

Approximate Nearest Neighbour Search

Embedding Uncertainty

Uncertainty-Aware Retrieval

Long-Tail Coverage

Recommender Systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributional Embeddings

Uncertainty-Aware Retrieval

Approximate Nearest Neighbour Search