MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Speech enhancement often faces a trade-off between computational efficiency and perceptual quality. To address this, we propose MAGE—a lightweight and efficient speech enhancement system based on masked generative modeling. Our method introduces a sparsity-aware, coarse-to-fine masking strategy that prioritizes recovery of high-frequency components and rare tokens, coupled with a selective layer retention mechanism to improve generalization. Additionally, a lightweight dynamic correction module is incorporated to refine low-confidence predictions. Built upon the BigCodec architecture and fine-tuned from Qwen2.5-0.5B, MAGE achieves state-of-the-art perceptual quality on benchmark datasets (DNS Challenge and noisy LibriSpeech) with only 200M parameters. It significantly reduces downstream ASR word error rates and outperforms larger baseline models in both fidelity and efficiency.

Technology Category

Application Category

📝 Abstract

Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.

Problem

Research questions and friction points this paper is trying to address.

Balancing efficiency and perceptual quality in speech enhancement

Improving generative speech enhancement with compact robust design

Reducing word error rate for downstream speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine masking strategy prioritizing frequent then rare tokens

Lightweight corrector module detecting low-confidence predictions for refinement

Selective layer retention reducing model to 200M parameters

🔎 Similar Papers

No similar papers found.