dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing large language models struggle to balance performance and accuracy under uniform low-precision floating-point quantization. This work proposes dMX, the first differentiable mixed-precision quantization framework for the OCP-defined MXFP format. By compressing discrete per-layer bit-width search into a single continuous learnable offset and integrating temperature-annealing scheduling with target-aware regularization, dMX enables a smooth transition from training to hardware deployment. Experiments on Llama, Qwen3, and SmolLM2 demonstrate that dMX consistently outperforms KL-divergence-based heuristic methods across WikiText-2 perplexity and four zero-shot reasoning tasks, achieving Pareto-optimal trade-offs between accuracy and bit-width.

📝 Abstract

Quantizing large language models (LLMs) to low-precision floating-point representations is central to efficient deployment, yet applying a single bit-width uniformly across all layers is sub-optimal in terms of both performance and accuracy. This work introduces dMX, a differentiable mixed-precision quantization framework for learnable floating-point bit-width assignment. We study its application for the microscaling floating-point (MXFP) family of data types defined by the Open Compute Project (OCP) standard. The per-layer bit-width assignment is formulated as a continuous optimization problem in which each layer's floating-point format format is parameterized by a scalar parameter, folding the multi-variate design space into a single learnable offset. During training this offset takes continuous values, avoiding sudden oscillations between discrete quantization formats. A temperature-based annealing schedule progressively discretizes the learned offsets, ensuring that the final configuration maps to hardware-compatible MXFP formats without abrupt transitions between training and inference behavior. A target-aware regularization term steers the average bit-width toward a user-specified budget, serving as a coarse-grained proxy for inference cost and balancing model quality against deployment efficiency. We performed experiments on different families of LLM, such as Llama, Qwen3, and SmolLM2, evaluating perplexity on WikiText-2 and accuracy on four zero-shot reasoning benchmarks. Across these settings, dMX consistently yields Pareto-dominating models and improves over Kullback-Leibler (KL) divergence-based layer-selection heuristics, efficiently navigating trade-offs between model quality and average bit-width.

Problem

Research questions and friction points this paper is trying to address.

mixed-precision quantization

low-precision floating-point

bit-width assignment

large language models

model compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

differentiable quantization

mixed-precision

MXFP