RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer

📅 2025-01-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Neural vocoders face challenges in long-sequence, sample-level audio generation—including high computational cost, imbalance between global and local modeling, and poor real-time performance—especially under Transformer-based architectures. To address these, we propose RingFormer: the first real-time, high-fidelity vocoder that integrates ring attention into a lightweight Conformer backbone. Its key contributions are: (1) ring attention enables efficient long-range dependency modeling while preserving local waveform details; (2) the first adoption of a dual discriminator (waveform + spectrogram) in vocoder adversarial training; and (3) seamless integration with the VITS end-to-end TTS framework. Experiments show RingFormer matches or surpasses SOTA models (e.g., HiFi-GAN, BigVGAN) in objective metrics (MCD, F0 RMSE) and subjective MOS scores, achieves end-to-end latency <10 ms, and supports real-time synthesis. Code and audio samples are publicly available.

Technology Category

Application Category

📝 Abstract

While transformers demonstrate outstanding performance across various audio tasks, their application to neural vocoders remains challenging. Neural vocoders require the generation of long audio signals at the sample level, which demands high temporal resolution. This results in significant computational costs for attention map generation and limits their ability to efficiently process both global and local information. Additionally, the sequential nature of sample generation in neural vocoders poses difficulties for real-time processing, making the direct adoption of transformers impractical. To address these challenges, we propose RingFormer, a neural vocoder that incorporates the ring attention mechanism into a lightweight transformer variant, the convolution-augmented transformer (Conformer). Ring attention effectively captures local details while integrating global information, making it well-suited for processing long sequences and enabling real-time audio generation. RingFormer is trained using adversarial training with two discriminators. The proposed model is applied to the decoder of the text-to-speech model VITS and compared with state-of-the-art vocoders such as HiFi-GAN, iSTFT-Net, and BigVGAN under identical conditions using various objective and subjective metrics. Experimental results show that RingFormer achieves comparable or superior performance to existing models, particularly excelling in real-time audio generation. Our code and audio samples are available on GitHub.

Problem

Research questions and friction points this paper is trying to address.

Neural vocoders

Long audio generation

Computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

RingFormer

Circular Attention

Dual Discriminator Adversarial Training

🔎 Similar Papers

No similar papers found.

Authors to Follow