🤖 AI Summary
To address performance bottlenecks of classical quantum circuit simulation on multicore servers, this paper proposes a low-level co-optimization framework for single-node systems: integrating NUMA-aware memory allocation and thread pinning, AVX-512 vectorization, aggressive loop unrolling, explicit prefetching, and locality-driven computational flow restructuring. This work presents the first open-source QuEST extension incorporating a full-stack of NUMA-aware optimizations, filling a critical gap in high-performance, reproducible open-source quantum simulators. Experimental evaluation demonstrates speedups of 5.5–6.5× for single-qubit gates, 4.5× for two-qubit gates, 4× for random quantum circuits, and 1.8× for quantum Fourier transforms. These improvements substantially increase both the scale and practicality of quantum circuits simulatable on classical hardware.
📝 Abstract
Scalable classical simulation of quantum circuits is crucial for advancing both quantum algorithm development and hardware validation. In this work, we focus on performance enhancements through meticulous low-level tuning on a single-node system, thereby not only advancing the performance of classical quantum simulations but also laying the groundwork for scalable, heterogeneous implementations that may eventually bridge the gap toward noiseless quantum computing. Although similar efforts in low-level tuning have been reported in the literature, such implementations have not been released as open-source software, thereby impeding independent evaluation and further development. We introduce an open-source, high-performance extension to the QuEST simulator that brings state-of-the-art low-level and NUMA optimizations to modern computers. Our approach emphasizes locality-aware computation and incorporates hardware-specific optimizations such as NUMA-aware memory allocation, thread pinning, AVX-512 vectorization, aggressive loop unrolling, and explicit memory prefetching. Experiments demonstrate significant speedups - 5.5-6.5x for single-qubit gate operations, 4.5x for two-qubit gates, 4x for Random Quantum Circuits (RQC), and 1.8x for Quantum Fourier Transform (QFT), demonstrating that rigorous performance tuning can substantially extend the practical simulation capacity of classical quantum simulators on current hardware.