Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In neural audio coding, ensuring consistent perceptual quality across diverse audio content under high compression ratios remains a critical challenge. This paper introduces MUFFIN, a fully convolutional framework featuring the novel psychoacoustically guided Multi-Band Spectral Residual Vector Quantization (MBS-RVQ), which dynamically allocates bitrates according to auditory saliency. MUFFIN decouples speaker identity from linguistic content, enabling seamless integration with large language models. Its architecture combines a Transformer-inspired convolutional backbone, an enhanced Snake activation function, and fine-grained multi-band spectral reconstruction. Experiments demonstrate that MUFFIN surpasses state-of-the-art methods across multiple benchmarks; its high-compression variant achieves a remarkably low bitrate of 12.5 bits per second while maintaining minimal distortion. Moreover, it significantly improves performance on downstream generative tasks—validating its effectiveness and generalizability as a high-fidelity, semantically rich audio token representation.

Technology Category

Application Category

📝 Abstract

Achieving high-fidelity audio compression while preserving perceptual quality across diverse content remains a key challenge in Neural Audio Coding (NAC). We introduce MUFFIN, a fully convolutional Neural Psychoacoustic Coding (NPC) framework that leverages psychoacoustically guided multi-band frequency reconstruction. At its core is a Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) module that allocates bitrate across frequency bands based on perceptual salience. This design enables efficient compression while disentangling speaker identity from content using distinct codebooks. MUFFIN incorporates a transformer-inspired convolutional backbone and a modified snake activation to enhance resolution in fine-grained spectral regions. Experimental results on multiple benchmarks demonstrate that MUFFIN consistently outperforms existing approaches in reconstruction quality. A high-compression variant achieves a state-of-the-art 12.5 Hz rate with minimal loss. MUFFIN also proves effective in downstream generative tasks, highlighting its promise as a token representation for integration with language models. Audio samples and code are available.

Problem

Research questions and friction points this paper is trying to address.

Achieving high-fidelity audio compression preserving perceptual quality

Allocating bitrate across frequency bands based on perceptual salience

Disentangling speaker identity from content using distinct codebooks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Band Spectral Residual Vector Quantization module

Psychoacoustically guided multi-band frequency reconstruction

Transformer-inspired convolutional backbone with snake activation

🔎 Similar Papers

Apollo: Band-sequence Modeling for High-Quality Audio Restoration

2024-09-13arXiv.orgCitations: 0

Authors to Follow