TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing streaming voice conversion methods, whose static speaker representations fail to align with dynamic linguistic content, thereby compromising naturalness, intelligibility, and privacy. To overcome this, the authors propose a low-latency end-to-end speech synthesizer that employs a content-synchronized time-varying timbre (TVT) representation. This approach leverages a global speaker memory mechanism to generate compact yet multifaceted speaker embeddings, combined with frame-level attention, gated conditioning, and spherical interpolation to enable smooth local variations while preserving the geometric structure of speaker identity. Furthermore, a factorized vector-quantized bottleneck is introduced to suppress identity leakage. The system achieves a GPU latency of under 80 milliseconds and outperforms state-of-the-art streaming voice conversion methods in terms of naturalness, speaker similarity, and anonymization performance.

Technology Category

Application Category

📝 Abstract
Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with<80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.
Problem

Research questions and friction points this paper is trying to address.

voice conversion
speaker anonymization
time-varying timbre
low-latency synthesis
content-synchronous representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

time-varying timbre
streamable voice conversion
speaker anonymization
content-synchronous representation
vector-quantized bottleneck
🔎 Similar Papers
No similar papers found.
W
Waris Quamer
Department of Computer Science and Engineering, Texas A&M University, College Station, TX 77840, USA
M
Mu-Ruei Tseng
Department of Computer Science and Engineering, Texas A&M University, College Station, TX 77840, USA
G
Ghady Nasrallah
Department of Computer Science and Engineering, Texas A&M University, College Station, TX 77840, USA
Ricardo Gutierrez-Osuna
Ricardo Gutierrez-Osuna
Texas A&M University, Computer Science and Engineering
Speech generationdigital healthwearable sensorsmachine learningchemometrics