Reducing Storage of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Deploying neural networks on resource-constrained devices faces a fundamental tension between increasing model size and limited storage capacity. This paper proposes a post-training compression framework that jointly optimizes rate-distortion trade-offs and quantization accuracy. It introduces, for the first time, quadratic rate estimation directly into layer-wise loss functions and derives a locally exact solution based on Optimal Brain Surgeon (OBS). The method supports arbitrary quantization grids and incurs negligible decoding overhead. It unifies rate-constrained quantization, entropy coding, and inter-layer joint distortion-rate modeling. Evaluated on multiple vision models, our approach achieves 20–40% bitrate reduction over NNCodec while preserving full model accuracy. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

The ever-growing size of neural networks poses serious challenges on resource-constrained devices, such as embedded sensors. Compression algorithms that reduce their size can mitigate these problems, provided that model performance stays close to the original. We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding by (1) extending the well-known layer-wise loss by a quadratic rate estimation, and (2) providing locally exact solutions to this modified objective following the Optimal Brain Surgeon (OBS) method. Our method allows for very fast decoding and is compatible with arbitrary quantization grids. We verify our results empirically by testing on various computer-vision networks, achieving a 20-40% decrease in bit rate at the same performance as the popular compression algorithm NNCodec. Our code is available at https://github.com/Conzel/cerwu.

Problem

Research questions and friction points this paper is trying to address.

Reducing storage for pretrained neural networks

Compressing models without losing performance

Fast decoding with arbitrary quantization grids

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines rate-aware quantization with entropy coding

Extends layer-wise loss with quadratic rate estimation

Provides locally exact solutions via OBS method

🔎 Similar Papers

MCNC: Manifold-Constrained Reparameterization for Neural Compression