🤖 AI Summary
To address high inference latency of large language models (LLMs) in resource-constrained environments, this paper proposes a storage-augmented inference paradigm. It leverages LLMs to generate 150K high-quality, deduplicated query–response pairs—occupying only 830 MB—and constructs a lightweight on-disk vector database. The method introduces an adaptive query masking and sampling strategy to enhance coverage of pre-stored responses and improve semantic retrieval accuracy. By integrating embedding-based indexing with similarity matching, it achieves low-overhead, high-precision response retrieval. Experimental results demonstrate that the approach reduces on-device inference latency by up to 17.3% without any degradation in response quality, significantly improving feasibility for edge deployment. Key contributions include: (i) a scalable, compact vector database built via controlled LLM generation; (ii) novel adaptive masking and sampling to maximize retrieval efficacy; and (iii) an efficient, index-driven retrieval pipeline enabling real-time inference under strict resource constraints.
📝 Abstract
Large language model (LLM) inference often suffers from high latency, particularly in resource-constrained environments such as on-device or edge deployments. To address this challenge, we present StorInfer, a novel storage-assisted LLM inference system that accelerates response time by precomputing and storing predictable query-response pairs offline. When a user query semantically matches a precomputed query, StorInfer bypasses expensive GPU inference and instantly returns the stored response, significantly reducing latency and compute costs. To maximize coverage and effectiveness, StorInfer employs an LLM-driven generator that adaptively produces diverse and deduplicated queries based on a given knowledge base. This is achieved via two techniques: adaptive query masking, which prevents regeneration of similar queries, and adaptive sampling, which dynamically tunes generation parameters to promote semantic diversity. The resulting query-response pairs are embedded and indexed using a disk-backed vector database to enable fast, similarity-based retrieval at runtime. Using this approach, we generated 150K unique precomputed pairs (taking up to 830 MB of storage space), achieving up to 17.3% latency reduction with no loss in response quality. Our evaluation across multiple QA datasets demonstrates the practicality and scalability of storage-assisted inference, especially in scenarios with predictable query distributions. StorInfer highlights a promising direction in leveraging storage as a primary enabler for efficient, low-latency LLM deployment.