🤖 AI Summary
This work addresses the challenge of ultra-low-bitrate audio coding for machine perception. We propose a task-oriented residual vector quantization (RVQ) method to compress and quantize intermediate feature representations from pretrained speech/audio models. Unlike conventional paradigms optimized for perceptual fidelity, our approach directly incorporates downstream-task-specific losses—such as ASR word error rate or audio classification accuracy—as optimization objectives for quantization, enabling joint bitrate–performance optimization. The framework supports multi-bitrate adaptation and cross-model-scale transferability. Evaluated on automatic speech recognition and audio classification tasks, it achieves compression rates below 200 bps while retaining over 99% of the original model’s performance—significantly outperforming state-of-the-art neural audio codecs.
📝 Abstract
Neural audio codecs, leveraging quantization algorithms, have significantly impacted various speech/audio tasks. While high-fidelity reconstruction is paramount for human perception, audio coding for machines (ACoM) prioritizes efficient compression and downstream task performance, disregarding perceptual nuances. This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model. Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low bitrates (i.e., less than 200 bps) with a minimal loss of the downstream model performance. The resulting tokenizer is adaptable to various bitrates and model sizes for flexible deployment. Evaluated on automatic speech recognition and audio classification, our method demonstrates its efficacy and potential for broader task and architectural applicability through appropriate regularization.