SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient model compression for neural networks on resource-constrained devices, this paper proposes a Bayesian variational learning framework that unifies pruning and low-bit quantization for the first time. Methodologically, it innovatively combines spike-and-slab priors with Gaussian mixture models (GMMs), theoretically proving the consistency of joint sparsification–quantization optimization, and achieves end-to-end co-learning of structural sparsity and low-bit weights via variational inference. Experiments on ResNet, BERT-base, Llama3, and Qwen2.5 demonstrate that, under comparable accuracy degradation, the proposed framework achieves significantly higher compression ratios than state-of-the-art methods—validating its efficacy in delivering high compression efficiency without compromising model performance.

Technology Category

Application Category

📝 Abstract
Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.
Problem

Research questions and friction points this paper is trying to address.

Compressing large neural networks for resource-limited devices
Unifying pruning and quantization via Bayesian variational learning
Achieving higher compression rates while maintaining model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian variational learning for unified pruning and quantization
Spike-and-slab prior induces sparsity in neural networks
Gaussian Mixture Models enable low-bit weight quantization
🔎 Similar Papers
No similar papers found.