🤖 AI Summary
To address the challenges of deploying retrieval-augmented generation (RAG) on memory- and energy-constrained mobile devices, this paper proposes the first fully localized, efficient, and low-overhead on-device RAG framework. Methodologically, it innovatively integrates EcoVector—a lightweight vector retrieval algorithm enabling on-demand loading of chunked indices—with Selective Content Reduction (SCR), a technique that dynamically compresses retrieved passages to fit compact language model input constraints, thereby substantially reducing computational overhead. Our key contributions include: (i) the first end-to-end, offline-capable on-device RAG system; and (ii) significant efficiency gains—42% lower latency, 58% reduced memory footprint, and 37% less energy consumption—versus baseline approaches, without compromising generation accuracy. The framework ensures strong privacy preservation, real-time responsiveness, and resource efficiency, establishing a practical paradigm for edge intelligence.
📝 Abstract
Retrieval-Augmented Generation (RAG) has proven effective on server infrastructures, but its application on mobile devices is still underexplored due to limited memory and power resources. Existing vector search and RAG solutions largely assume abundant computation resources, making them impractical for on-device scenarios. In this paper, we propose MobileRAG, a fully on-device pipeline that overcomes these limitations by combining a mobile-friendly vector search algorithm, extit{EcoVector}, with a lightweight extit{Selective Content Reduction} (SCR) method. By partitioning and partially loading index data, EcoVector drastically reduces both memory footprint and CPU usage, while the SCR method filters out irrelevant text to diminish Language Model (LM) input size without degrading accuracy. Extensive experiments demonstrated that MobileRAG significantly outperforms conventional vector search and RAG methods in terms of latency, memory usage, and power consumption, while maintaining accuracy and enabling offline operation to safeguard privacy in resource-constrained environments.