🤖 AI Summary
To address the energy-efficiency bottleneck in Transformer inference—characterized by excessive off-chip memory accesses and low hardware utilization—this work proposes a synergistic training–post-training compression framework, dynamic batch-size control flow, and a bidirectional-accessible register file architecture. We introduce a custom register file supporting bidirectional read/write operations to enable efficient reuse of KV caches and activations; pioneer a dynamic batching mechanism that adapts to varying sequence lengths and workload intensities; and integrate mixed-precision quantization with structured sparsification for model compression. Fabricated in 16-nm FinFET technology, the accelerator achieves a 58% reduction in off-chip memory traffic and boosts hardware utilization to 92% across multiple Transformer models. Per-token inference latency ranges from 68 to 567 μs, with energy efficiency reaching 0.41–3.95 μJ/token.
📝 Abstract
This work introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference. Additionally, a new control flow mechanism, called dynamic batching, and a novel buffer architecture, termed a two-direction accessible register file, further reduce external memory access while improving hardware utilization.