🤖 AI Summary
Recommendation system DNN inference faces severe energy-efficiency bottlenecks. While Processing-in-Memory (PIM) architectures offer promising solutions, their design space is prohibitively large (>10⁵⁴), and operator-hardware co-mapping remains highly complex. This paper proposes an end-to-end automated PIM accelerator design methodology—the first to jointly search recommendation models and mixed-precision PIM architectures. We introduce a novel one-stage hypernetwork parameterization framework that unifies combinatorial optimization, PIM-aware operator mapping, mixed-precision interaction modeling, and RTL-level automatic synthesis. Evaluated on click-through rate (CTR) prediction, our approach achieves 3.36× inference speedup, 1.68× area reduction, and 12.48× energy-efficiency improvement over handcrafted designs. This work significantly advances the hardware deployment of high-efficiency recommendation systems.
📝 Abstract
The performance bottleneck of deep-learning-based recommender systems resides in their backbone Deep Neural Networks. By integrating Processing-In-Memory~(PIM) architectures, researchers can reduce data movement and enhance energy efficiency, paving the way for next-generation recommender models. Nevertheless, achieving performance and efficiency gains is challenging due to the complexity of the PIM design space and the intricate mapping of operators. In this paper, we demonstrate that automated PIM design is feasible even within the most demanding recommender model design space, spanning over $10^{54}$ possible architectures. We propose methodname, which formulates the co-optimization of recommender models and PIM design as a combinatorial search over mixed-precision interaction operations, and parameterizes the search with a one-shot supernet encompassing all mixed-precision options. We comprehensively evaluate our approach on three Click-Through Rate benchmarks, showcasing the superiority of our automated design methodology over manual approaches. Our results indicate up to a 3.36$ imes$ speedup, 1.68$ imes$ area reduction, and 12.48$ imes$ higher power efficiency compared to naively mapped searched designs and state-of-the-art handcrafted designs.