SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing zero-shot speech synthesis approaches struggle to simultaneously achieve semantic accuracy and acoustic fidelity: semantic tokens ensure precise text alignment but sacrifice acoustic detail, whereas acoustic encoders preserve voice quality yet lack linguistic constraints. This work proposes SARA, a dual-stream variational autoencoder that, for the first time, directly fuses frozen self-supervised learning (SSL)-derived semantic representations with a trainable residual acoustic encoder. By doing so, SARA jointly models semantic and acoustic information without relying on complex regularization terms. The method constructs an efficient and compact latent space, significantly enhancing the naturalness, expressiveness, and robustness of synthesized speech. Moreover, it maintains stable performance under accelerated inference, effectively balancing high-quality output with computational efficiency.

📝 Abstract

Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens from self-supervised learning (SSL) models ensure precise text alignment but discard some acoustic information. To bridge this gap, we propose SARA, a dual-stream VAE that directly fuses a frozen SSL semantic anchor with a dedicated residual acoustic encoder. This effectively mitigates the dilemma, creating an efficient and compact latent space without relying on complex regularizers. SARA achieves superior reconstruction quality over strong baselines. Furthermore, in downstream zero-shot TTS tasks, it yields highly natural and expressive synthesis quality, and maintains robust generation performance even under accelerated inference, offering a favorable trade-off between synthesis speed and computational cost.

Problem

Research questions and friction points this paper is trying to address.

zero-shot TTS

speech representation

acoustic tokens

semantic tokens

speech tokenization

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-stream VAE

semantic-acoustic fusion

zero-shot TTS