SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address insufficient model compression for neural networks on resource-constrained devices, this paper proposes a Bayesian variational learning framework that unifies pruning and low-bit quantization for the first time. Methodologically, it innovatively combines spike-and-slab priors with Gaussian mixture models (GMMs), theoretically proving the consistency of joint sparsification–quantization optimization, and achieves end-to-end co-learning of structural sparsity and low-bit weights via variational inference. Experiments on ResNet, BERT-base, Llama3, and Qwen2.5 demonstrate that, under comparable accuracy degradation, the proposed framework achieves significantly higher compression ratios than state-of-the-art methods—validating its efficacy in delivering high compression efficiency without compromising model performance.

Technology Category

Application Category

📝 Abstract

Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.

Problem

Research questions and friction points this paper is trying to address.

Compressing large neural networks for resource-limited devices

Unifying pruning and quantization via Bayesian variational learning

Achieving higher compression rates while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian variational learning for unified pruning and quantization

Spike-and-slab prior induces sparsity in neural networks

Gaussian Mixture Models enable low-bit weight quantization

🔎 Similar Papers

Effective Interplay between Sparsity and Quantization: From Theory to Practice