SCALE-Sim v3: A modular cycle-accurate systolic accelerator simulator for end-to-end system analysis

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI accelerator simulators (e.g., SCALE-Sim v2) lack integrated modeling of sparsity, multicore scalability, and fine-grained memory behavior. This work introduces the first modular, cycle-accurate, full-stack simulation platform tailored for modern AI accelerators. Our approach integrates five key innovations: (1) spatio-temporal multicore partitioning; (2) layer- and row-level sparse matrix multiplication; (3) Ramulator-driven DRAM modeling; (4) automatic data layout optimization; and (5) Accelergy-enabled energy-efficiency estimation. Built on a Python/C++ hybrid architecture, the platform supports both weight-stationary (WS) and output-stationary (OS) dataflows and multiple sparsity patterns. Experimental evaluation on ViT-base shows that a 64×64 systolic array achieves optimal energy-delay product (EdP); moreover, OS reduces total execution cycles—including DRAM stalls—by 30.1% versus WS, substantially correcting the bias inherent in conventional compute-only cycle estimates.

Technology Category

Application Category

📝 Abstract
The rapid advancements in AI, scientific computing, and high-performance computing (HPC) have driven the need for versatile and efficient hardware accelerators. Existing tools like SCALE-Sim v2 provide valuable cycle-accurate simulations for systolic-array-based architectures but fall short in supporting key modern features such as sparsity, multi-core scalability, and comprehensive memory analysis. To address these limitations, we present SCALE-Sim v3, a modular, cycle-accurate simulator that extends the capabilities of its predecessor. SCALE-Sim v3 introduces five significant enhancements: multi-core simulation with spatio-temporal partitioning and hierarchical memory structures, support for sparse matrix multiplications (SpMM) with layer-wise and row-wise sparsity, integration with Ramulator for detailed DRAM analysis, precise data layout modeling to minimize memory stalls, and energy and power estimation via Accelergy. These improvements enable deeper end-to-end system analysis for modern AI accelerators, accommodating a wide variety of systems and workloads and providing detailed full-system insights into latency, bandwidth, and power efficiency. A 128x128 array is 6.53x faster than a 32x32 array for ViT-base, using only latency as a metric. However, SCALE-Sim v3 finds that 32x32 is 2.86x more energy-efficient due to better utilization and lower leakage energy. For EdP, 64x64 outperforms both 128x128 and 32x32 for ViT-base. SCALE-Sim v2 shows a 21% reduction in compute cycles for six ResNet18 layers using weight-stationary (WS) dataflow compared to output-stationary (OS). However, when factoring in DRAM stalls, OS dataflow exhibits 30.1% lower execution cycles compared to WS, highlighting the critical role of detailed DRAM analysis.
Problem

Research questions and friction points this paper is trying to address.

Enhancing systolic accelerator simulation for modern AI workloads
Addressing limitations in sparsity and multi-core scalability support
Improving memory and power efficiency analysis in hardware accelerators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular cycle-accurate simulator for systolic accelerators
Supports sparse matrix multiplications with sparsity
Integrates Ramulator for detailed DRAM analysis
🔎 Similar Papers
No similar papers found.