TRAPTI: Time-Resolved Analysis for SRAM Banking and Power Gating Optimization in Embedded Transformer Inference

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the significant on-chip memory pressure caused by the growing Key-Value (KV) cache with sequence length in embedded Transformer inference, which severely limits energy efficiency and performance. The authors propose a two-stage optimization approach: first, cycle-accurate simulation is employed to obtain time-resolved memory occupancy traces, which then guide the design of SRAM bank partitioning and power-gating configurations. This study pioneers the application of time-resolved memory analysis to co-optimize memory architecture and power consumption for embedded Transformers and reveals that Grouped-Query Attention (GQA) offers a substantial reduction in peak memory footprint compared to Multi-Head Attention (MHA). Experimental results demonstrate that, under identical accelerator configurations, DeepSeek-R1-Distill-Qwen-1.5B achieves a 2.72× lower peak on-chip memory usage than GPT-2 XL, markedly enhancing power-gating opportunities.

Technology Category

Application Category

📝 Abstract

Transformer neural networks achieve state-of-the-art accuracy across language and vision tasks, but their deployment on embedded hardware is hindered by stringent area, latency, and energy constraints. During inference, performance and efficiency are increasingly dominated by the Key--Value (KV) cache, whose memory footprint grows with sequence length, straining on-chip memory utilization. Although existing mechanisms such as Grouped-Query Attention (GQA) reduce KV cache requirements compared to Multi-Head Attention (MHA), effectively exploiting this reduction requires understanding how on-chip memory demand evolves over time. This work presents TRAPTI, a two-stage methodology that combines cycle-level inference simulation with time-resolved analysis of on-chip memory occupancy to guide design decisions. In the first stage, the framework obtains memory occupancy traces and memory access statistics from simulation. In the second stage, the framework leverages the traces to explore banked memory organizations and power-gating configurations in an offline optimization flow. We apply this methodology to GPT-2 XL and DeepSeek-R1-Distill-Qwen-1.5B under the same accelerator configuration, enabling a direct comparison of MHA and GQA memory profiles. The analysis shows that DeepSeek-R1-Distill-Qwen-1.5B exhibits a 2.72x reduction in peak on-chip memory utilization in this setting compared to GPT-2 XL, unlocking further opportunities for power-gating optimization.

Problem

Research questions and friction points this paper is trying to address.

Transformer inference

KV cache

on-chip memory

embedded systems

memory utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Time-Resolved Analysis

SRAM Banking

Power Gating