Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In neural audio coding, ensuring consistent perceptual quality across diverse audio content under high compression ratios remains a critical challenge. This paper introduces MUFFIN, a fully convolutional framework featuring the novel psychoacoustically guided Multi-Band Spectral Residual Vector Quantization (MBS-RVQ), which dynamically allocates bitrates according to auditory saliency. MUFFIN decouples speaker identity from linguistic content, enabling seamless integration with large language models. Its architecture combines a Transformer-inspired convolutional backbone, an enhanced Snake activation function, and fine-grained multi-band spectral reconstruction. Experiments demonstrate that MUFFIN surpasses state-of-the-art methods across multiple benchmarks; its high-compression variant achieves a remarkably low bitrate of 12.5 bits per second while maintaining minimal distortion. Moreover, it significantly improves performance on downstream generative tasks—validating its effectiveness and generalizability as a high-fidelity, semantically rich audio token representation.

Technology Category

Application Category

📝 Abstract
Achieving high-fidelity audio compression while preserving perceptual quality across diverse content remains a key challenge in Neural Audio Coding (NAC). We introduce MUFFIN, a fully convolutional Neural Psychoacoustic Coding (NPC) framework that leverages psychoacoustically guided multi-band frequency reconstruction. At its core is a Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) module that allocates bitrate across frequency bands based on perceptual salience. This design enables efficient compression while disentangling speaker identity from content using distinct codebooks. MUFFIN incorporates a transformer-inspired convolutional backbone and a modified snake activation to enhance resolution in fine-grained spectral regions. Experimental results on multiple benchmarks demonstrate that MUFFIN consistently outperforms existing approaches in reconstruction quality. A high-compression variant achieves a state-of-the-art 12.5 Hz rate with minimal loss. MUFFIN also proves effective in downstream generative tasks, highlighting its promise as a token representation for integration with language models. Audio samples and code are available.
Problem

Research questions and friction points this paper is trying to address.

Achieving high-fidelity audio compression preserving perceptual quality
Allocating bitrate across frequency bands based on perceptual salience
Disentangling speaker identity from content using distinct codebooks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Band Spectral Residual Vector Quantization module
Psychoacoustically guided multi-band frequency reconstruction
Transformer-inspired convolutional backbone with snake activation
🔎 Similar Papers
Dianwen Ng
Dianwen Ng
MiroMind, Alibaba-NTU Singapore Joint Research Institute
Artificial IntelligenceDeep LearningSpeech RecognitionSelf-supervised Learning
K
Kun Zhou
Tongyi Speech Lab, Alibaba Group, Singapore
Y
Yi-Wen Chao
College of Computing & Data Science, Nanyang Technological University, Singapore
Zhiwei Xiong
Zhiwei Xiong
University of Science and Technology of China
computational photographybiomedical image analysis
B
Bin Ma
Tongyi Speech Lab, Alibaba Group, Singapore
E
Eng Siong Chng
College of Computing & Data Science, Nanyang Technological University, Singapore