MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing low-bit block floating-point (BFP) formats—such as MXFP4—suffer significant accuracy degradation during large language model (LLM) inference due to intra-block outliers. Method: We propose a non-intrusive, micro-scaling format extension that repurposes the exponent bits of outlier elements as additional mantissa bits, thereby enhancing their representational precision without modifying hardware or software frameworks and maintaining full compatibility with existing MX formats. Leveraging the structural properties of BFP and MX representations, our approach employs data-type field reuse to achieve this enhancement with near-zero storage and computational overhead. Contribution/Results: Experimental evaluation across multiple LLMs demonstrates consistent and substantial performance gains over MXFP4 under 4-bit quantization. Our method establishes a new paradigm for efficient low-bit LLM inference—one that simultaneously delivers high compatibility, minimal implementation cost, and superior accuracy.

Technology Category

Application Category

📝 Abstract

Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrusive modifications to the software frameworks or are rather unconventional for widespread adoption across hardware vendors. In this paper, we instead focus on recent industry-driven variants of block floating-point (BFP) formats and conduct a comprehensive analysis to push their limits for efficient LLM serving. Our analysis shows that existing ultra low-bit BFP variants struggle to provide reasonable language model performance due to outlier values in blocks. To address the outliers with BFPs, we propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats. MX+ builds on the key insight that the outlier does not need to use its exponent field in the element data type, which allows us to repurpose the exponent field as an extended mantissa to increase the precision of the outlier element. Our evaluation shows that MX+ achieves significantly higher model performance compared to the 4-bit MX format (MXFP4) with negligible storage overhead and slowdown, thus offering a compelling alternative to MXFP4 or MXFP6 for efficient LLM inference.

Problem

Research questions and friction points this paper is trying to address.

Addressing outlier values in ultra low-bit block floating-point formats

Improving precision for efficient large language model serving

Enhancing model performance with minimal storage and speed overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes MX+ extension for microscaling formats

Repurposes exponent field for outlier precision

Enables efficient 4-bit LLM inference with minimal overhead

🔎 Similar Papers

Towards Pareto Optimal Throughput in Small Language Model Serving