Benchmarking Neural Speech Compression from a Rate-Distortion Perspective

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitation of existing neural speech codecs at low bitrates, where representation learning and probabilistic modeling are typically decoupled, hindering effective exploitation of the non-uniform distribution and temporal dependencies of latent variables and thus limiting compression efficiency. Drawing upon rate–distortion theory, the authors propose the Entropy-Constrained Codec (ECC), which introduces explicit probabilistic modeling into neural speech compression for the first time. ECC incorporates an entropy-skipping mechanism that bypasses highly predictable residual symbols without requiring transmission of additional masks. The framework integrates scalar quantization, hyperpriors, channel-wise context modeling, residual prediction, and lightweight temporal modeling within an end-to-end entropy-constrained training paradigm. Experiments demonstrate that ECC significantly outperforms baseline methods on two standard test sets, achieving average BD-rate reductions of 39.9% (ViSQOL) and 76.3% (PESQ), thereby confirming the critical role of jointly optimizing quantization and probabilistic modeling for enhanced rate–distortion performance.

📝 Abstract

Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability modeling, limiting their ability to exploit the non-uniform usage and temporal dependencies of learned speech latents. In this paper, we benchmark neural speech compression from a rate--distortion perspective and further investigate entropy-constrained coding for low-bitrate speech compression. We first formulate a unified learning-based speech coding pipeline and provide a benchmark-style analysis of recent neural speech codecs, showing that explicit probability modeling remains underexplored in learned speech compression. We then propose ECC, an Entropy-Constrained Codec that combines scalar quantization with a learned entropy model. ECC integrates hyperprior-based side information, channel-wise context modeling, latent residual prediction, and lightweight temporal modeling to estimate latent likelihoods for rate estimation during training and arithmetic coding during inference. To further improve low-bitrate efficiency, ECC introduces entropy skip, which omits highly predictable residual symbols using decoder-available scale estimates without transmitting additional skip masks. Extensive experiments show that ECC achieves a favorable low-bitrate rate--distortion trade-off over conventional and neural codec baselines, reducing BD-rate by 39.9% on ViSQOL and 76.3% on PESQ on average over two widely-used test sets. Ablation and diagnostic studies further validate the effectiveness of entropy modeling. Project Page: https://avery-xu.github.io/ECC-demo/

Problem

Research questions and friction points this paper is trying to address.

neural speech compression

rate-distortion

entropy modeling

low-bitrate coding

learned speech codecs

Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy-constrained coding

neural speech compression

rate-distortion optimization