MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing INT4 kernels underutilize NVIDIA Blackwell’s FP4 Tensor Cores, and low-precision quantization often incurs significant accuracy degradation. Method: We propose the first hardware-aware mixed-precision quantization framework tailored for FP4 accelerators. Built upon the Microscaling (MX) format, it supports channel-wise dynamic combinations of MXFP4, MXFP6, and MXFP8, and introduces a layer-sensitivity-aware quantization thresholding mechanism for fine-grained precision allocation across weights and activations. Our custom matrix multiplication kernel natively outputs BFloat16 results, fully leveraging FP4 Tensor Cores. Results: Evaluated on Llama and Qwen families, our method achieves >20% higher inference throughput than TensorRT-LLM’s FP8 backend, significantly reduces prefill latency, improves memory efficiency, and maintains robust downstream task performance without accuracy loss.

Technology Category

Application Category

📝 Abstract

Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. MicroMix achieves competitive or superior performance across diverse downstream tasks, including zero-shot and few-shot learning, language modeling, code generation, and mathematical reasoning. On both consumer-grade (RTX 5070Ti laptop) and server-grade (RTX 5090) GPUs, our kernel delivers at least 20% faster execution than TensorRT-FP8. Furthermore, when applied to various Llama and Qwen models, MicroMix consistently improves prefill latency and memory efficiency across a range of batch sizes compared to TensorRT baselines. Our code is available at https://github.com/lwy2020/MicroMix.

Problem

Research questions and friction points this paper is trying to address.

Bridging INT4 and FP4 format gaps for efficient LLM inference

Optimizing mixed-precision quantization for accuracy and speed

Enhancing GPU performance with tailored MXFP4/6/8 kernels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-precision quantization with Microscaling formats

Tailored kernel for Blackwell architecture support

Selective precision allocation for accuracy-efficiency balance

🔎 Similar Papers

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

2024-09-25arXiv.orgCitations: 19

Authors to Follow