🤖 AI Summary
Existing floating-point GEMM accelerators struggle to balance operating frequency, computational unit utilization, and buffering overhead. This work proposes O-POPE—an scalable output-stationary outer-product engine—that innovatively repurposes the pipeline registers of floating-point units (FPUs) as data buffers. By doing so, it drastically reduces additional storage overhead while enabling high-frequency, high-utilization computation. A 2048-MACs implementation in 12nm FinFET technology occupies less than 2% of its area for buffering, achieves a 1 GHz operating frequency, and sustains 99.97% FPU utilization. Compared to state-of-the-art designs, O-POPE delivers a 1.33× performance improvement, along with 9% higher performance density and 8% better energy efficiency.
📝 Abstract
General matrix multiply (GEMM) dominates both execution time and energy consumption of modern machine learning (ML) workloads, placing increasing pressure on hardware efficiency. While quantization mitigates computational and data movement costs, accuracy-sensitive tasks such as training still require higher-precision floating-point formats. Existing floating-point GEMM accelerators face trade-offs between operating frequency, arithmetic utilization, and buffering overhead. This work presents O-POPE, a scalable outer-product engine that achieves concurrently high utilization, low overhead, and a fast operating frequency by repurposing floating-point unit (FPU) pipeline registers as buffers. This solution leverages the data-reuse advantages of output-stationary outer-product execution and enables 1 GHz (0.72 V) operation in 12 nm FINFET technology with less than 2% buffer area for a 2048-MACs configuration. Our evaluation shows that O-POPE achieves up to 99.97% FPU utilization and improves performance (1.33x), performance density by 9%, and energy efficiency by 8%, compared to state-of-the-art floating-point GEMM accelerators.