🤖 AI Summary
This study investigates the representational capacity of the CNN front-end in Wav2Vec 2.0 for monophthong acoustic properties, specifically focusing on front–back vowel discrimination. Using the TIMIT corpus, we extract activation values from each CNN layer and compare them against hand-crafted features—MFCCs and MFCCs augmented with formant frequencies—using SVM classification accuracy as an interpretable, quantitative evaluation metric. Layer-wise representational analysis reveals that low-level CNN activations achieve classification performance on par with or superior to traditional features, demonstrating implicit encoding of critical acoustic cues (e.g., F1/F2 distributions). This systematic, layer-resolved framework provides an interpretable and quantifiable lens into the internal representation mechanisms of self-supervised speech models. The findings underscore the efficacy and phonemic representational potential of Wav2Vec 2.0’s early-layer features for segmental speech analysis.
📝 Abstract
Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.