T-REX: A 68-567 {mu}s/token, 0.41-3.95 {mu}J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET

📅 2025-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the energy-efficiency bottleneck in Transformer inference—characterized by excessive off-chip memory accesses and low hardware utilization—this work proposes a synergistic training–post-training compression framework, dynamic batch-size control flow, and a bidirectional-accessible register file architecture. We introduce a custom register file supporting bidirectional read/write operations to enable efficient reuse of KV caches and activations; pioneer a dynamic batching mechanism that adapts to varying sequence lengths and workload intensities; and integrate mixed-precision quantization with structured sparsification for model compression. Fabricated in 16-nm FinFET technology, the accelerator achieves a 58% reduction in off-chip memory traffic and boosts hardware utilization to 92% across multiple Transformer models. Per-token inference latency ranges from 68 to 567 μs, with energy efficiency reaching 0.41–3.95 μJ/token.

Technology Category

Application Category

📝 Abstract
This work introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference. Additionally, a new control flow mechanism, called dynamic batching, and a novel buffer architecture, termed a two-direction accessible register file, further reduce external memory access while improving hardware utilization.
Problem

Research questions and friction points this paper is trying to address.

Reduces external memory access in transformer model inference.
Introduces dynamic batching to improve hardware utilization.
Proposes a two-direction accessible register file architecture.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel training and post-training compression schemes
Dynamic batching control flow mechanism
Two-direction accessible register file architecture
🔎 Similar Papers
No similar papers found.