MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the high inference costs of large language models (LLMs) in real-world deployment, particularly in scenarios with frequent repetitive or semantically similar queries that lead to significant computational redundancy. To mitigate this, the authors propose MemBoost, a novel framework that leverages a memory mechanism to reuse historical responses and dynamically retrieve relevant information. MemBoost employs a lightweight model to handle routine queries while selectively routing uncertain or complex requests to a more capable LLM. This approach uniquely integrates memory-augmented reasoning, cost-aware query routing, and a collaborative architecture between light and heavy models, extending beyond the single-pass limitations of conventional retrieval-augmented generation. Experimental results demonstrate that MemBoost substantially reduces both LLM invocation frequency and overall inference cost across diverse models and simulated workloads, while maintaining response quality on par with that of the strong model alone.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

inference cost

query duplication

cost-aware serving

memory reuse

Innovation

Methods, ideas, or system contributions that make the work stand out.

cost-aware inference

memory-augmented LLM

answer reuse