DIRC-RAG: Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Edge-device RAG applications suffer from excessive storage overhead, latency, and energy consumption. This work proposes DIRC, a digital in-memory computing (IMC) architecture tailored for edge RAG acceleration, integrating multilayer ReRAM with SRAM to achieve high-density non-volatile storage (5.18 Mb/mm²) and single-cycle data loading. To enhance computational robustness and energy efficiency, DIRC introduces four key innovations: bit-level error optimization, differential sensing readout, query-stationary dataflow (QS-dataflow), and bit-level data remapping. Implemented in TSMC 40 nm CMOS, the system delivers 131 TOPS peak compute throughput, achieves a mere 5.6 μs retrieval latency per query, and consumes only 0.956 μJ/query—while maintaining high retrieval accuracy. To the best of our knowledge, this is the first demonstration of a high-accuracy digital IMC system deployed end-to-end for edge RAG acceleration, effectively overcoming critical memory bandwidth and energy-efficiency bottlenecks.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieval but faces challenges on edge devices due to high storage, energy, and latency demands. Computing-in-Memory (CIM) offers a promising solution by storing document embeddings in CIM macros and enabling in-situ parallel retrievals but is constrained by either low memory density or limited computational accuracy. To address these challenges, we present DIRCRAG, a novel edge RAG acceleration architecture leveraging Digital In-ReRAM Computation (DIRC). DIRC integrates a high-density multi-level ReRAM subarray with an SRAM cell, utilizing SRAM and differential sensing for robust ReRAM readout and digital multiply-accumulate (MAC) operations. By storing all document embeddings within the CIM macro, DIRC achieves ultra-low-power, single-cycle data loading, substantially reducing both energy consumption and latency compared to offchip DRAM. A query-stationary (QS) dataflow is supported for RAG tasks, minimizing on-chip data movement and reducing SRAM buffer requirements. We introduce error optimization for the DIRC ReRAM-SRAM cell by extracting the bit-wise spatial error distribution of the ReRAM subarray and applying targeted bit-wise data remapping. An error detection circuit is also implemented to enhance readout resilience against deviceand circuit-level variations. Simulation results demonstrate that DIRC-RAG under TSMC40nm process achieves an on-chip non-volatile memory density of 5.18Mb/mm2 and a throughput of 131 TOPS. It delivers a 4MB retrieval latency of 5.6μs/query and an energy consumption of 0.956μJ/query, while maintaining the retrieval precision.
Problem

Research questions and friction points this paper is trying to address.

Accelerating RAG on edge devices with high-density memory
Enhancing computational accuracy in memory for retrieval operations
Reducing energy and latency in edge AI knowledge retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Digital In-ReRAM Computation with SRAM integration
Query-stationary dataflow minimizing on-chip movement
Error optimization via bit-wise spatial remapping
🔎 Similar Papers
No similar papers found.