LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates the geometric structure of self-supervised speech representations (HuBERT/Wav2Vec 2.0) and demonstrates that high-fidelity voice conversion can be achieved via linear transformations alone. Methodologically, it imposes feature-space rotation constraints and employs low-rank (rank ≈ 100) SVD decomposition to disentangle content from speaker identity—without requiring nonlinear decoders. The key contributions are threefold: (i) the first theoretical and empirical validation that purely linear operations match the performance of complex end-to-end nonlinear models; (ii) the discovery that phonetic content information is intrinsically confined to a low-dimensional subspace, providing an interpretable geometric foundation for representation disentanglement; and (iii) competitive voice conversion quality on VCTK and LibriSpeech, with significantly reduced parameter count and inference cost—advancing lightweight, interpretable speech representation learning.

Technology Category

Application Category

📝 Abstract

We introduce LinearVC, a simple voice conversion method that sheds light on the structure of self-supervised representations. First, we show that simple linear transformations of self-supervised features effectively convert voices. Next, we probe the geometry of the feature space by constraining the set of allowed transformations. We find that just rotating the features is sufficient for high-quality voice conversion. This suggests that content information is embedded in a low-dimensional subspace which can be linearly transformed to produce a target voice. To validate this hypothesis, we finally propose a method that explicitly factorizes content and speaker information using singular value decomposition; the resulting linear projection with a rank of just 100 gives competitive conversion results. Our work has implications for both practical voice conversion and a broader understanding of self-supervised speech representations. Samples and code: https://www.kamperh.com/linearvc/.

Problem

Research questions and friction points this paper is trying to address.

LinearVC explores voice conversion using self-supervised feature transformations

Investigates low-dimensional subspace structure for content and speaker separation

Proposes rank-100 linear projection for competitive voice conversion results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear transformations of self-supervised features

Rotation of features for voice conversion

Singular value decomposition for factorization

🔎 Similar Papers

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

2024-09-03arXiv.orgCitations: 0

Authors to Follow