Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of ultra-low-bitrate audio coding for machine perception. We propose a task-oriented residual vector quantization (RVQ) method to compress and quantize intermediate feature representations from pretrained speech/audio models. Unlike conventional paradigms optimized for perceptual fidelity, our approach directly incorporates downstream-task-specific losses—such as ASR word error rate or audio classification accuracy—as optimization objectives for quantization, enabling joint bitrate–performance optimization. The framework supports multi-bitrate adaptation and cross-model-scale transferability. Evaluated on automatic speech recognition and audio classification tasks, it achieves compression rates below 200 bps while retaining over 99% of the original model’s performance—significantly outperforming state-of-the-art neural audio codecs.

Technology Category

Application Category

📝 Abstract

Neural audio codecs, leveraging quantization algorithms, have significantly impacted various speech/audio tasks. While high-fidelity reconstruction is paramount for human perception, audio coding for machines (ACoM) prioritizes efficient compression and downstream task performance, disregarding perceptual nuances. This work introduces an efficient ACoM method that can compress and quantize any chosen intermediate feature representation of an already trained speech/audio downstream model. Our approach employs task-specific loss guidance alongside residual vector quantization (RVQ) losses, providing ultra-low bitrates (i.e., less than 200 bps) with a minimal loss of the downstream model performance. The resulting tokenizer is adaptable to various bitrates and model sizes for flexible deployment. Evaluated on automatic speech recognition and audio classification, our method demonstrates its efficacy and potential for broader task and architectural applicability through appropriate regularization.

Problem

Research questions and friction points this paper is trying to address.

Efficient audio compression for machine tasks, not human perception

Minimizing performance loss in downstream models at ultra-low bitrates

Adaptable tokenizer for various bitrates and model sizes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-specific loss guidance with RVQ

Ultra-low bitrates under 200 bps

Adaptable tokenizer for various deployments

🔎 Similar Papers

Compositional Audio Representation Learning