🤖 AI Summary
Irregular embedding lookups in recommender systems, sparse large language models, and graph learning impose severe performance bottlenecks on conventional hardware. Method: This paper proposes the Decoupled Access-Execution (DAE) hardware architecture and introduces the first DAE-aware multi-level intermediate representation (IR) compiler framework, enabling end-to-end automatic optimization. The framework integrates with PyTorch and TensorFlow frontends and jointly optimizes embedding operation scheduling and memory access patterns while preserving semantic correctness. Contribution/Results: It achieves, for the first time, compiler-generated code performance on par with hand-optimized kernels. Evaluated on end-to-end models, the system delivers 2.6× higher throughput and 6.4× better energy efficiency over state-of-the-art GPUs, fully unlocking the potential of the DAE architecture.
📝 Abstract
Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6$ imes$ higher performance and 6.4$ imes$ higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.