Selective Quantization Tuning for ONNX Models

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address the significant accuracy degradation and deployment challenges on resource-constrained hardware caused by full quantization, this paper proposes selective quantization—quantizing only a subset of layers in deep neural networks to achieve a multi-objective trade-off among accuracy, model size, and computational cost. To this end, we design and implement TuneQn, a comprehensive toolkit supporting selective quantization configuration for ONNX models, cross-platform (CPU/GPU) performance profiling, Pareto-optimal multi-objective optimization, and interactive visualization. Experimental results demonstrate that, compared to full-quantization baselines, our method reduces accuracy loss by up to 54.14%; relative to the original floating-point model, it achieves up to 72.9% model size reduction. These improvements substantially enhance deployability and energy efficiency on low-end accelerators.

Technology Category

Application Category

📝 Abstract

Quantization is a process that reduces the precision of deep neural network models to lower model size and computational demands, often at the cost of accuracy. However, fully quantized models may exhibit sub-optimal performance below acceptable levels and face deployment challenges on low-end hardware accelerators due to practical constraints. To address these issues, quantization can be selectively applied to only a subset of layers, but selecting which layers to exclude is non-trivial. To this direction, we propose TuneQn, a suite enabling selective quantization, deployment and execution of ONNX models across various CPU and GPU devices, combined with profiling and multi-objective optimization. TuneQn generates selectively quantized ONNX models, deploys them on different hardware, measures performance on metrics like accuracy and size, performs Pareto Front minimization to identify the best model candidate and visualizes the results. To demonstrate the effectiveness of TuneQn, we evaluated TuneQn on four ONNX models with two quantization settings across CPU and GPU devices. As a result, we demonstrated that our utility effectively performs selective quantization and tuning, selecting ONNX model candidates with up to a $54.14$% reduction in accuracy loss compared to the fully quantized model, and up to a $72.9$% model size reduction compared to the original model.

Problem

Research questions and friction points this paper is trying to address.

Selective quantization for ONNX models to balance accuracy and size

Optimizing layer selection for quantization to maintain performance

Deploying quantized models efficiently on diverse hardware platforms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective quantization for ONNX models

Multi-objective optimization for model tuning

Hardware-aware profiling and deployment

🔎 Similar Papers

No similar papers found.