LLMs as Sparse Retrievers:A Framework for First-Stage Product Search

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

In product search, sparse retrieval suffers from vocabulary mismatch, while directly applying large language models (LLMs) faces two key challenges: (1) hallucination in short queries/titles—e.g., spurious term expansion and attenuation of literal critical tokens (e.g., brand or model numbers); and (2) difficulty learning sparse representations under high-dimensional vocabularies. To address these, we propose PROSPER: a novel sparse retrieval framework that (1) introduces a literal residual network to explicitly preserve original keyword weights, mitigating term under-weighting; and (2) incorporates a lexical focusing window mechanism enabling coarse-to-fine two-stage sparsification during training. Extensive offline experiments show PROSPER significantly outperforms conventional sparse baselines and achieves recall competitive with state-of-the-art dense retrievers. Online A/B testing demonstrates substantial commercial revenue gains. PROSPER establishes a practical, LLM-augmented paradigm for first-stage sparse retrieval in production search systems.

Technology Category

Application Category

📝 Abstract

Product search is a crucial component of modern e-commerce platforms, with billions of user queries every day. In product search systems, first-stage retrieval should achieve high recall while ensuring efficient online deployment. Sparse retrieval is particularly attractive in this context due to its interpretability and storage efficiency. However, sparse retrieval methods suffer from severe vocabulary mismatch issues, leading to suboptimal performance in product search scenarios.With their potential for semantic analysis, large language models (LLMs) offer a promising avenue for mitigating vocabulary mismatch issues and thereby improving retrieval quality. Directly applying LLMs to sparse retrieval in product search exposes two key challenges:(1)Queries and product titles are typically short and highly susceptible to LLM-induced hallucinations, such as generating irrelevant expansion terms or underweighting critical literal terms like brand names and model numbers;(2)The large vocabulary space of LLMs leads to difficulty in initializing training effectively, making it challenging to learn meaningful sparse representations in such ultra-high-dimensional spaces.To address these challenges, we propose PROSPER, a framework for PROduct search leveraging LLMs as SParsE Retrievers. PROSPER incorporates: (1)A literal residual network that alleviates hallucination in lexical expansion by reinforcing underweighted literal terms through a residual compensation mechanism; and (2)A lexical focusing window that facilitates effective training initialization via a coarse-to-fine sparsification strategy.Extensive offline and online experiments show that PROSPER significantly outperforms sparse baselines and achieves recall performance comparable to advanced dense retrievers, while also achieving revenue increments online.

Problem

Research questions and friction points this paper is trying to address.

Addresses vocabulary mismatch in sparse retrieval for product search

Mitigates LLM hallucinations in short queries and product titles

Solves training challenges in ultra-high-dimensional sparse representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Literal residual network reduces LLM hallucinations

Lexical focusing window enables coarse-to-fine sparsification

PROSPER framework combines sparse retrieval with LLMs

🔎 Similar Papers

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval