Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Automatic prosodic accent recognition—encompassing phrasal, lexical, and contrastive accents—remains challenging; moreover, current ASR systems exhibit limited inclusivity for neurodiverse individuals and speakers across gender identities. Method: We propose the first fine-grained multi-task fine-tuning framework based on Whisper-large-v2, uniquely adapting it to jointly recognize all three accent types while simultaneously classifying neurotype and gender. We construct and manually annotate a neurodiverse speech dataset and introduce bias-mitigating multi-task learning to disentangle prosodic, neurocognitive, and sociodemographic attributes. Contribution/Results: Our model achieves near-human accent recognition accuracy (≥92%), with neurotype and gender classification accuracies exceeding 99%. It significantly enhances ASR robustness and fairness across diverse articulatory patterns and speaker demographics, advancing interpretable and inclusive speech technologies.

Technology Category

Application Category

📝 Abstract

Prosody plays a crucial role in speech perception, influencing both human understanding and automatic speech recognition (ASR) systems. Despite its importance, prosodic stress remains under-studied due to the challenge of efficiently analyzing it. This study explores fine-tuning OpenAI's Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. Using a dataset of 66 native English speakers, including male, female, neurotypical, and neurodivergent individuals, we assess the model's ability to generalize stress patterns and classify speakers by neurotype and gender based on brief speech samples. Our results highlight near-human accuracy in ASR performance across all three stress types and near-perfect precision in classifying gender and neurotype. By improving prosody-aware ASR, this work contributes to equitable and robust transcription technologies for diverse populations.

Problem

Research questions and friction points this paper is trying to address.

Fine-tuning Whisper for prosodic stress analysis.

Improving ASR for diverse speaker populations.

Enhancing gender and neurotype classification accuracy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Whisper model for prosodic stress analysis

Dataset includes diverse neurotypes and genders

Achieves near-human accuracy in stress recognition

🔎 Similar Papers

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper