🤖 AI Summary
This work addresses three key challenges in quantization evaluation for neural codecs: high training overhead, unreliable gradient approximation, and the absence of an efficient evaluation paradigm. We propose a lightweight, simulation-driven framework for quantization effect assessment. Our method models nonlinear quantization behavior in large-scale models using low-complexity codecs and synthetically generated data with controllable bitwidths. It integrates statistical quantization simulation, soft/hard quantization annealing, and an enhanced straight-through estimator (STE) to identify and mitigate STE instability. Compared to full-scale training, our approach reduces evaluation cost by an order of magnitude in both time and hardware resources. Furthermore, it systematically uncovers the differential impact of various gradient approximation strategies on downstream performance. Extensive validation on an internal audio codec and the Descript Audio Codec demonstrates both effectiveness and cross-architecture generalizability.
📝 Abstract
Neural codecs, comprising an encoder, quantizer, and decoder, enable signal transmission at exceptionally low bitrates. Training these systems requires techniques like the straight-through estimator, soft-to-hard annealing, or statistical quantizer emulation to allow a non-zero gradient across the quantizer. Evaluating the effect of quantization in neural codecs, like the influence of gradient passing techniques on the whole system, is often costly and time-consuming due to training demands and the lack of affordable and reliable metrics. This paper proposes an efficient evaluation framework for neural codecs using simulated data with a defined number of bits and low-complexity neural encoders/decoders to emulate the non-linear behavior in larger networks. Our system is highly efficient in terms of training time and computational and hardware requirements, allowing us to uncover distinct behaviors in neural codecs. We propose a modification to stabilize training with the straight-through estimator based on our findings. We validate our findings against an internal neural audio codec and against the state-of-the-art descript-audio-codec.