🤖 AI Summary
This study addresses the limitations of conventional phoneme error rate (PER) evaluation by conducting the first fine-grained phoneme-level error analysis on raw waveform acoustic models. The authors decompose PER across three broad phoneme categories and construct substitution confusion matrices to systematically uncover error distribution patterns. Their approach combines parametric (SincNet, Sinc2Net) and non-parametric CNN architectures with bidirectional LSTMs, achieving state-of-the-art PERs of 13.9% and 15.3% on the TIMIT development and test sets, respectively. Further incorporating transfer learning from WSJ reduces these rates to 11.3% and 12.3%, outperforming filterbank-based baselines. The analysis reveals that BLSTMs substantially enhance modeling of transition-dependent phoneme categories, while WSJ transfer learning yields significant improvements in consonant recognition.
📝 Abstract
We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.