🤖 AI Summary
To address the low inference efficiency of ultra-low-bitwidth (1/1.58/2-bit) large language models on resource-constrained platforms such as AI PCs and edge devices—where accuracy and speed are difficult to balance—this paper proposes a hardware-software co-design optimization method tailored for modern CPUs. We design and implement highly optimized 1-bit and 2-bit specialized microkernels, tightly integrating them into the PyTorch-TPP framework via quantization-aware compilation and runtime scheduling. Our core contribution lies in transcending conventional generic operators by enabling hardware-level adaptation for ultra-low-bit tensor computations. Experiments demonstrate that our 2-bit implementation achieves 2.2× higher end-to-end inference throughput than the state-of-the-art bitnet.cpp, and up to 7× speedup over the 16-bit baseline, while preserving competitive language modeling performance.
📝 Abstract
The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments such as edge devices and AI PCs. While these quantization advances promise models that are more cost-effective in terms of latency, memory, throughput, and energy consumption, the computational efficiency of state-of-the-art (SOTA) inference runtimes (e.g., bitnet.cpp) used to deploy them remains underexplored. In this work, we take a bottom-up approach: we first design and implement 1-bit and 2-bit microkernels optimized for modern CPUs, achieving peak computational efficiency across a variety of CPU platforms. We integrate these microkernels into a state-of-the-art LLM inference framework, namely PyTorch-TPP, and present end-to-end inference results with 2-bit models that outperform the current SOTA runtime bitnet.cpp by up to 2.2x, and deliver up to 7x speedup compared to the 16-bit model inference. Our optimized runtime advances the state of LLM inference on AI PCs and edge devices, paving the way for efficient deployment of ultra-low-bit LLM models.