🤖 AI Summary
Existing AI accelerator simulators (e.g., SCALE-Sim v2) lack integrated modeling of sparsity, multicore scalability, and fine-grained memory behavior. This work introduces the first modular, cycle-accurate, full-stack simulation platform tailored for modern AI accelerators. Our approach integrates five key innovations: (1) spatio-temporal multicore partitioning; (2) layer- and row-level sparse matrix multiplication; (3) Ramulator-driven DRAM modeling; (4) automatic data layout optimization; and (5) Accelergy-enabled energy-efficiency estimation. Built on a Python/C++ hybrid architecture, the platform supports both weight-stationary (WS) and output-stationary (OS) dataflows and multiple sparsity patterns. Experimental evaluation on ViT-base shows that a 64×64 systolic array achieves optimal energy-delay product (EdP); moreover, OS reduces total execution cycles—including DRAM stalls—by 30.1% versus WS, substantially correcting the bias inherent in conventional compute-only cycle estimates.
📝 Abstract
The rapid advancements in AI, scientific computing, and high-performance computing (HPC) have driven the need for versatile and efficient hardware accelerators. Existing tools like SCALE-Sim v2 provide valuable cycle-accurate simulations for systolic-array-based architectures but fall short in supporting key modern features such as sparsity, multi-core scalability, and comprehensive memory analysis. To address these limitations, we present SCALE-Sim v3, a modular, cycle-accurate simulator that extends the capabilities of its predecessor. SCALE-Sim v3 introduces five significant enhancements: multi-core simulation with spatio-temporal partitioning and hierarchical memory structures, support for sparse matrix multiplications (SpMM) with layer-wise and row-wise sparsity, integration with Ramulator for detailed DRAM analysis, precise data layout modeling to minimize memory stalls, and energy and power estimation via Accelergy. These improvements enable deeper end-to-end system analysis for modern AI accelerators, accommodating a wide variety of systems and workloads and providing detailed full-system insights into latency, bandwidth, and power efficiency. A 128x128 array is 6.53x faster than a 32x32 array for ViT-base, using only latency as a metric. However, SCALE-Sim v3 finds that 32x32 is 2.86x more energy-efficient due to better utilization and lower leakage energy. For EdP, 64x64 outperforms both 128x128 and 32x32 for ViT-base. SCALE-Sim v2 shows a 21% reduction in compute cycles for six ResNet18 layers using weight-stationary (WS) dataflow compared to output-stationary (OS). However, when factoring in DRAM stalls, OS dataflow exhibits 30.1% lower execution cycles compared to WS, highlighting the critical role of detailed DRAM analysis.