🤖 AI Summary
This study systematically evaluates the feasibility of replacing NVIDIA A100 GPUs with Intel Gaudi-2 NPUs for AI inference serving. Method: We design a microbenchmarking framework assessing four critical dimensions—compute performance, memory bandwidth, inter-device communication efficiency, and energy efficiency—and conduct end-to-end AI workload comparisons. We further perform deep software co-optimization, including FBGEMM operator customization and vLLM inference engine adaptation tailored to Gaudi-2’s NPU architecture. Contribution/Results: To our knowledge, this is the first real-world inference evaluation demonstrating that Gaudi-2 achieves A100-level throughput and energy efficiency across mainstream LLMs—reaching up to 92% of A100’s throughput while delivering 1.3× higher performance-per-watt. We propose NPU-aware operator fusion and scheduling optimizations that substantially alleviate software-stack immaturity bottlenecks. Our findings indicate that Gaudi-2 possesses tangible technical potential to challenge GPU dominance in inference, contingent upon framework-level, hardware-software co-design.
📝 Abstract
With the rise of AI, NVIDIA GPUs have become the de facto standard for AI system design. This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs for AI model serving. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves competitive performance not only in primitive AI compute, memory, and communication operations but also in executing several important AI workloads end-to-end. We then assess Gaudi NPU's programmability by discussing several software-level optimization strategies to employ for implementing critical FBGEMM operators and vLLM, evaluating their efficiency against GPU-optimized counterparts. Results indicate that Gaudi-2 achieves energy efficiency comparable to A100, though there are notable areas for improvement in terms of software maturity. Overall, we conclude that, with effective integration into high-level AI frameworks, Gaudi NPUs could challenge NVIDIA GPU's dominance in the AI server market, though further improvements are necessary to fully compete with NVIDIA's robust software ecosystem.