MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

πŸ“… 2026-03-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high inference costs of large language models (LLMs) in real-world deployment, particularly in scenarios with frequent repetitive or semantically similar queries that lead to significant computational redundancy. To mitigate this, the authors propose MemBoost, a novel framework that leverages a memory mechanism to reuse historical responses and dynamically retrieve relevant information. MemBoost employs a lightweight model to handle routine queries while selectively routing uncertain or complex requests to a more capable LLM. This approach uniquely integrates memory-augmented reasoning, cost-aware query routing, and a collaborative architecture between light and heavy models, extending beyond the single-pass limitations of conventional retrieval-augmented generation. Experimental results demonstrate that MemBoost substantially reduces both LLM invocation frequency and overall inference cost across diverse models and simulated workloads, while maintaining response quality on par with that of the strong model alone.
πŸ“ Abstract
Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
inference cost
query duplication
cost-aware serving
memory reuse
Innovation

Methods, ideas, or system contributions that make the work stand out.

cost-aware inference
memory-augmented LLM
answer reuse
dynamic routing
interactive serving
πŸ”Ž Similar Papers
No similar papers found.
J
Joris KΓΆster
Department of Computer Science, Aalto University, Finland
Z
Zixuan Liu
Department of Computer Science, Tulane University, LA, USA
S
Siavash Khajavi
Department of Industrial Engineering and Management, Aalto University, Finland
Zizhan Zheng
Zizhan Zheng
Tulane University
AI Security and SafetyReinforcement LearningGenerative AINetworks