LightRetriever: A LLM-based Hybrid Retrieval Architecture with 1000x Faster Query Inference

๐Ÿ“… 2025-05-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address high query encoding latency and substantial resource overhead in LLM-driven hybrid retrieval, this paper proposes a heterogeneous, decoupled lightweight query encoding architecture: documents retain full LLM-based encoding capability, while queries bypass real-time LLM inference entirely and instead leverage GPU-accelerated embedding lookup. This is the first design achieving complete decoupling of query and document encoding, drastically reducing computational load. On an H800 GPU, it achieves over 1000ร— query inference speedup; even without GPU acceleration, it attains 20ร— speedup. Crucially, it maintains 95% of the full-LLM retrieval accuracy on large-scale benchmarks. The core contribution lies in reformulating query encoding in hybrid retrieval as an efficient embedding lookup taskโ€”enabling an order-of-magnitude improvement in inference efficiency with negligible accuracy degradation.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs)-based hybrid retrieval uses LLMs to encode queries and documents into low-dimensional dense or high-dimensional sparse vectors. It retrieves documents relevant to search queries based on vector similarities. Documents are pre-encoded offline, while queries arrive in real-time, necessitating an efficient online query encoder. Although LLMs significantly enhance retrieval capabilities, serving deeply parameterized LLMs slows down query inference throughput and increases demands for online deployment resources. In this paper, we propose LightRetriever, a novel LLM-based hybrid retriever with extremely lightweight query encoders. Our method retains a full-sized LLM for document encoding, but reduces the workload of query encoding to no more than an embedding lookup. Compared to serving a full-sized LLM on an H800 GPU, our approach achieves over a 1000x speedup for query inference with GPU acceleration, and even a 20x speedup without GPU. Experiments on large-scale retrieval benchmarks demonstrate that our method generalizes well across diverse retrieval tasks, retaining an average of 95% full-sized performance.
Problem

Research questions and friction points this paper is trying to address.

Efficient online query encoding for hybrid retrieval
Reducing LLM query inference throughput slowdown
Maintaining retrieval performance with lightweight encoders
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid retrieval with LLM-based dense/sparse vectors
Lightweight query encoder via embedding lookup
1000x faster query inference with GPU acceleration
๐Ÿ”Ž Similar Papers
No similar papers found.
Guangyuan Ma
Guangyuan Ma
Chinese Academy of Sciences
Information Retrieval
Yongliang Ma
Yongliang Ma
Langboat Technology
LLMRAGInformation RetrievalNatrual Language ProcessingDocument Understanding
X
Xuanrui Gou
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Z
Zhenpeng Su
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
M
Ming Zhou
Langboat Technology, Beijing, China
S
Songlin Hu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China