voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the automatic classification of four vocal modes in singing—breathy, neutral, flow, and pressed phonation—by leveraging self-supervised speech foundation models such as HuBERT and wav2vec 2.0. The proposed approach extracts hierarchical embeddings from early layers of these models, applies global temporal pooling, and employs lightweight classifiers (SVM or XGBoost) to achieve effective vocal mode recognition. This work presents the first empirical validation of the transferability of general-purpose speech foundation models to singing vocal tasks, demonstrating that early-layer embeddings outperform conventional handcrafted acoustic features. On a soprano dataset, the method achieves 95.7% accuracy, surpassing spectral baselines by 12–15% and overcoming longstanding performance limitations of traditional approaches.

Technology Category

Application Category

📝 Abstract

We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (~95.7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline. We also show layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).

Problem

Research questions and friction points this paper is trying to address.

phonation mode

singing voice

voice classification

speech foundation models

self-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised speech models

phonation mode classification

HuBERT