How Does a Deep Neural Network Look at Lexical Stress?

📅 2025-08-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the interpretability of deep neural networks in identifying stress placement in English disyllabic words. We propose an analytical framework integrating convolutional neural networks (CNNs) with Layerwise Relevance Propagation (LRP), enabling the first systematic identification of acoustic cues underlying stress detection from spectrograms. Methodologically, we introduce feature-specific relevance analysis to uncover how the model leverages distributed phonetic information. Results reveal that, beyond canonical cues such as F1/F2 formant shifts in stressed syllables, the model robustly integrates pitch contours and F3 dynamics—challenging assumptions from traditional controlled-stimulus paradigms. Evaluated on a minimal-pair-free test set, our model achieves 92% accuracy, demonstrating strong generalization to natural prosodic variability. These findings advance data-driven modeling of speech prosody and provide both methodological innovation and empirical evidence for interpretable AI in speech cognition research.

Technology Category

Application Category

📝 Abstract
Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel's first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning's ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.
Problem

Research questions and friction points this paper is trying to address.

Understand how deep neural networks interpret lexical stress patterns
Analyze CNN decisions using Layerwise Relevance Propagation (LRP)
Identify spectral features influencing stress prediction in disyllabic words
Innovation

Methods, ideas, or system contributions that make the work stand out.

CNN architectures predict lexical stress position
Layerwise Relevance Propagation analyzes CNN decisions
Feature-specific relevance reveals formants' influence
🔎 Similar Papers
No similar papers found.
Itai Allouche
Itai Allouche
Technion, Israel Institute of Technology
Deep LearningSpeech Processing
I
Itay Asael
Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology, Haifa, 3200003, Israel
R
Rotem Rousso
Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology, Haifa, 3200003, Israel
V
Vered Dassa
Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology, Haifa, 3200003, Israel
Ann Bradlow
Ann Bradlow
Department of Linguistics, Northwestern University
speech perceptionexperimental phoneticscross-language and second-language phonetics
S
Seung-Eun Kim
Department of Linguistics, Northwestern University, Evanston, Illinois 60208, USA
M
Matthew Goldrick
Department of Linguistics, Northwestern University, Evanston, Illinois 60208, USA
Joseph Keshet
Joseph Keshet
Professor, Faculty of Electrical & Computer Engineering, Technion
Machine LearningSpeech and Language ProcessingSpoken Language Processing