Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Irregular embedding lookups in recommender systems, sparse large language models, and graph learning impose severe performance bottlenecks on conventional hardware. Method: This paper proposes the Decoupled Access-Execution (DAE) hardware architecture and introduces the first DAE-aware multi-level intermediate representation (IR) compiler framework, enabling end-to-end automatic optimization. The framework integrates with PyTorch and TensorFlow frontends and jointly optimizes embedding operation scheduling and memory access patterns while preserving semantic correctness. Contribution/Results: It achieves, for the first time, compiler-generated code performance on par with hand-optimized kernels. Evaluated on end-to-end models, the system delivers 2.6× higher throughput and 6.4× better energy efficiency over state-of-the-art GPUs, fully unlocking the potential of the DAE architecture.

Technology Category

Application Category

📝 Abstract

Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6$ imes$ higher performance and 6.4$ imes$ higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.

Problem

Research questions and friction points this paper is trying to address.

Addressing irregular embedding lookups in recommender and graph models

Improving performance and efficiency via Decoupled Access-Execute processors

Automating optimized code generation from PyTorch/TensorFlow for DAE architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialized access units for embedding lookups

Ember compiler for DAE code generation

Multiple intermediate representations for optimization

🔎 Similar Papers

No similar papers found.