A Memory-Efficient Retrieval Architecture for RAG-Enabled Wearable Medical LLMs-Agents

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address frequent memory accesses and high energy consumption during the retrieval phase of Retrieval-Augmented Generation (RAG) systems on edge medical devices, this work proposes a hierarchical retrieval architecture tailored for wearable medical large language model agents. The architecture employs a two-stage collaborative mechanism—approximate (coarse-grained) retrieval followed by high-precision (fine-grained) retrieval—integrating INT8-quantized embedding matching with an optimized indexing scheme. It achieves substantial resource savings while preserving clinical information retrieval accuracy. Implemented and validated in TSMC 28nm CMOS technology, the design reduces memory accesses by 49.8% and computational operations by 75.2% compared to a baseline pure-INT8 approach, achieving only 177.76 μJ energy per MB of data per query. This is the first systematic application of the hierarchical approximate-precise retrieval paradigm to low-power medical RAG systems, delivering a high-energy-efficiency, high-accuracy retrieval solution for resource-constrained edge environments.

Technology Category

Application Category

📝 Abstract
With powerful and integrative large language models (LLMs), medical AI agents have demonstrated unique advantages in providing personalized medical consultations, continuous health monitoring, and precise treatment plans. Retrieval-Augmented Generation (RAG) integrates personal medical documents into LLMs by an external retrievable database to address the costly retraining or fine-tuning issues in deploying customized agents. While deploying medical agents in edge devices ensures privacy protection, RAG implementations impose substantial memory access and energy consumption during the retrieval stage. This paper presents a hierarchical retrieval architecture for edge RAG, leveraging a two-stage retrieval scheme that combines approximate retrieval for candidate set generation, followed by high-precision retrieval on pre-selected document embeddings. The proposed architecture significantly reduces energy consumption and external memory access while maintaining retrieval accuracy. Simulation results show that, under TSMC 28nm technology, the proposed hierarchical retrieval architecture has reduced the overall memory access by nearly 50% and the computation by 75% compared to pure INT8 retrieval, and the total energy consumption for 1 MB data retrieval is 177.76 μJ/query.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory access in RAG retrieval for edge devices
Minimizing energy consumption during medical document retrieval
Maintaining retrieval accuracy while optimizing computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical retrieval architecture for edge RAG
Two-stage retrieval combining approximate and precise methods
Reduces memory access and energy consumption significantly
🔎 Similar Papers
No similar papers found.
Zhipeng Liao
Zhipeng Liao
Professor, Department of Economics, UCLA
economicseconometrics
K
Kunming Shao
The Hong Kong University of Science and Technology, Hong Kong SAR, China
Jiangnan Yu
Jiangnan Yu
The Hong Kong University of Science and Technology
LLM accelerationAI acceleratorNetwork on ChipMoEEmerging Non-Volatile-Memory
L
Liang Zhao
South China University of Technology, Guangzhou, China
T
Tim Kwang-Ting Cheng
The Hong Kong University of Science and Technology, Hong Kong SAR, China
C
Chi-Ying Tsui
The Hong Kong University of Science and Technology, Hong Kong SAR, China
J
Jie Yang
CenBRAIN, Westlake University, Hangzhou, China; Integrated-On-Chips Brain-Computer Interfaces Zhejiang Engineering Research Center, Hangzhou, China
M
Mohamad Sawan
CenBRAIN, Westlake University, Hangzhou, China; Integrated-On-Chips Brain-Computer Interfaces Zhejiang Engineering Research Center, Hangzhou, China