LinearVC: Linear transformations of self-supervised features through the lens of voice conversion

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the geometric structure of self-supervised speech representations (HuBERT/Wav2Vec 2.0) and demonstrates that high-fidelity voice conversion can be achieved via linear transformations alone. Methodologically, it imposes feature-space rotation constraints and employs low-rank (rank ≈ 100) SVD decomposition to disentangle content from speaker identity—without requiring nonlinear decoders. The key contributions are threefold: (i) the first theoretical and empirical validation that purely linear operations match the performance of complex end-to-end nonlinear models; (ii) the discovery that phonetic content information is intrinsically confined to a low-dimensional subspace, providing an interpretable geometric foundation for representation disentanglement; and (iii) competitive voice conversion quality on VCTK and LibriSpeech, with significantly reduced parameter count and inference cost—advancing lightweight, interpretable speech representation learning.

Technology Category

Application Category

📝 Abstract
We introduce LinearVC, a simple voice conversion method that sheds light on the structure of self-supervised representations. First, we show that simple linear transformations of self-supervised features effectively convert voices. Next, we probe the geometry of the feature space by constraining the set of allowed transformations. We find that just rotating the features is sufficient for high-quality voice conversion. This suggests that content information is embedded in a low-dimensional subspace which can be linearly transformed to produce a target voice. To validate this hypothesis, we finally propose a method that explicitly factorizes content and speaker information using singular value decomposition; the resulting linear projection with a rank of just 100 gives competitive conversion results. Our work has implications for both practical voice conversion and a broader understanding of self-supervised speech representations. Samples and code: https://www.kamperh.com/linearvc/.
Problem

Research questions and friction points this paper is trying to address.

LinearVC explores voice conversion using self-supervised feature transformations
Investigates low-dimensional subspace structure for content and speaker separation
Proposes rank-100 linear projection for competitive voice conversion results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear transformations of self-supervised features
Rotation of features for voice conversion
Singular value decomposition for factorization
🔎 Similar Papers
H
H. Kamper
Electrical and Electronic Engineering, Stellenbosch University, South Africa
B
B. V. Niekerk
Electrical and Electronic Engineering, Stellenbosch University, South Africa
Julian Zaidi
Julian Zaidi
Ubisoft La Forge
machine learningrepresentation learningspeech synthesistext-to-speech
M
M. Carbonneau
Ubisoft La Forge, Montreal, Canada