Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of deploying retrieval-augmented generation (RAG) systems on mobile devices, where high energy consumption and performance bottlenecks hinder the simultaneous achievement of privacy preservation, offline capability, and energy efficiency. For the first time, the authors demonstrate an end-to-end deployment of the full RAG pipeline—including embedding, reranking, and large language model (LLM) generation—on the Hexagon NPU of the Snapdragon X Elite platform. Compared to a CPU baseline, the NPU implementation achieves a 9.1× throughput improvement and 12.3× lower energy consumption during indexing. In the query phase, LLM prefilling is accelerated by 18.1×, while end-to-end latency and energy usage are both reduced by 4×, all without compromising answer quality relative to CPU or GPU counterparts. This study establishes a viable pathway toward efficient, low-power on-device RAG.

📝 Abstract

Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression -- a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

on-device inference

energy efficiency

mobile NPU

edge intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-device RAG

NPU acceleration

energy efficiency