🤖 AI Summary
Speech enhancement often faces a trade-off between computational efficiency and perceptual quality. To address this, we propose MAGE—a lightweight and efficient speech enhancement system based on masked generative modeling. Our method introduces a sparsity-aware, coarse-to-fine masking strategy that prioritizes recovery of high-frequency components and rare tokens, coupled with a selective layer retention mechanism to improve generalization. Additionally, a lightweight dynamic correction module is incorporated to refine low-confidence predictions. Built upon the BigCodec architecture and fine-tuned from Qwen2.5-0.5B, MAGE achieves state-of-the-art perceptual quality on benchmark datasets (DNS Challenge and noisy LibriSpeech) with only 200M parameters. It significantly reduces downstream ASR word error rates and outperforms larger baseline models in both fidelity and efficiency.
📝 Abstract
Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.