Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the underexplored issue of fairness in phoneme-based automatic speech recognition (ASR) systems across demographic dimensions such as race, age, gender, and accent. It presents the first systematic evaluation of group-level biases in two open-source IPA transcription models—WhisperIPA and ZIPA—using multilingual, demographically annotated corpora. To account for linguistically acceptable phonemic variation, the authors introduce a novel metric, Soft PER (Phoneme Error Rate), which relaxes strict phoneme matching. Experimental results demonstrate that even when accommodating such permissible variation, both models exhibit significant performance disparities across languages, genders, accents, ethnicities, and age groups. These findings underscore persistent fairness challenges in current IPA-based ASR systems and highlight the need for more equitable model development and evaluation practices.

📝 Abstract

The popularization of automatic speech recognition (ASR) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data. Most of these studies focused on standard grapheme-based ASR systems with comparatively little emphasis on phoneme-based systems, such as models that produce International Phonetic Alphabet (IPA) representations. As ASR systems shift toward multilingual support and low-resource language modeling, IPA-based layers serve as a critical, language-agnostic foundation. In this study, we evaluate the performance of two state-of-the-art open-source ASR systems, WhisperIPA and ZIPA, that generate IPA transcriptions across diverse accents and language sources. Our evaluation includes existing multilingual speech corpora and demographically annotated English-language corpora. We measure model performance by comparing model-generated IPA transcriptions against grapheme-to-phoneme (G2P) systems using both standard phoneme error rate (PER) and a proposed Soft PER metric that tolerates linguistically similar phoneme substitutions. Our analysis examines how performance varies across languages and demographic groups such as gender, accent, ethnicity, and age, revealing persistent disparities even after accounting for acceptable phonemic variation. These findings provide insight into potential sources of bias and inform the development of more inclusive and linguistically robust phoneme-based ASR systems. Our code and data will be made publicly available to the community.

Problem

Research questions and friction points this paper is trying to address.

bias

phoneme-based ASR

IPA transcription

demographic disparity

automatic speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

phoneme-based ASR

IPA transcription

bias evaluation