🤖 AI Summary
In neural audio coding, ensuring consistent perceptual quality across diverse audio content under high compression ratios remains a critical challenge. This paper introduces MUFFIN, a fully convolutional framework featuring the novel psychoacoustically guided Multi-Band Spectral Residual Vector Quantization (MBS-RVQ), which dynamically allocates bitrates according to auditory saliency. MUFFIN decouples speaker identity from linguistic content, enabling seamless integration with large language models. Its architecture combines a Transformer-inspired convolutional backbone, an enhanced Snake activation function, and fine-grained multi-band spectral reconstruction. Experiments demonstrate that MUFFIN surpasses state-of-the-art methods across multiple benchmarks; its high-compression variant achieves a remarkably low bitrate of 12.5 bits per second while maintaining minimal distortion. Moreover, it significantly improves performance on downstream generative tasks—validating its effectiveness and generalizability as a high-fidelity, semantically rich audio token representation.
📝 Abstract
Achieving high-fidelity audio compression while preserving perceptual quality across diverse content remains a key challenge in Neural Audio Coding (NAC). We introduce MUFFIN, a fully convolutional Neural Psychoacoustic Coding (NPC) framework that leverages psychoacoustically guided multi-band frequency reconstruction. At its core is a Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) module that allocates bitrate across frequency bands based on perceptual salience. This design enables efficient compression while disentangling speaker identity from content using distinct codebooks. MUFFIN incorporates a transformer-inspired convolutional backbone and a modified snake activation to enhance resolution in fine-grained spectral regions. Experimental results on multiple benchmarks demonstrate that MUFFIN consistently outperforms existing approaches in reconstruction quality. A high-compression variant achieves a state-of-the-art 12.5 Hz rate with minimal loss. MUFFIN also proves effective in downstream generative tasks, highlighting its promise as a token representation for integration with language models. Audio samples and code are available.