T-REX: A 68-567 {mu}s/token, 0.41-3.95 {mu}J/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET

📅 2025-03-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the energy-efficiency bottleneck in Transformer inference—characterized by excessive off-chip memory accesses and low hardware utilization—this work proposes a synergistic training–post-training compression framework, dynamic batch-size control flow, and a bidirectional-accessible register file architecture. We introduce a custom register file supporting bidirectional read/write operations to enable efficient reuse of KV caches and activations; pioneer a dynamic batching mechanism that adapts to varying sequence lengths and workload intensities; and integrate mixed-precision quantization with structured sparsification for model compression. Fabricated in 16-nm FinFET technology, the accelerator achieves a 58% reduction in off-chip memory traffic and boosts hardware utilization to 92% across multiple Transformer models. Per-token inference latency ranges from 68 to 567 μs, with energy efficiency reaching 0.41–3.95 μJ/token.

Technology Category

Application Category

📝 Abstract

This work introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference. Additionally, a new control flow mechanism, called dynamic batching, and a novel buffer architecture, termed a two-direction accessible register file, further reduce external memory access while improving hardware utilization.

Problem

Research questions and friction points this paper is trying to address.

Reduces external memory access in transformer model inference.

Introduces dynamic batching to improve hardware utilization.

Proposes a two-direction accessible register file architecture.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel training and post-training compression schemes

Dynamic batching control flow mechanism

Two-direction accessible register file architecture

🔎 Similar Papers

No similar papers found.

Authors to Follow