Physics-Informed Neural Networks for Speech Production

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in speech production modeling—namely, non-differentiability and gradient vanishing induced by vocal fold collisions, and the unknown fundamental period of self-sustained oscillation. To this end, we propose a physics-informed neural network (PINN) framework tailored for the glottis–vocal-tract coupled system. Methodologically: (i) a differentiable approximation function is introduced to model vocal fold collisions; (ii) the oscillation period is treated as a learnable parameter to accommodate inter-subject variability; and (iii) hard constraints are enforced to directly embed the physical coupling between glottal airflow and vocal-tract acoustics, eliminating error accumulation inherent in soft-constraint formulations. Experiments demonstrate that the unified architecture jointly infers glottal flow rate, vocal fold vibration dynamics, and subglottal pressure. Moreover, it supports both forward speech synthesis and inverse physiological parameter estimation, significantly improving physical consistency and cross-speaker generalization.

Technology Category

Application Category

📝 Abstract
The analysis of speech production based on physical models of the vocal folds and vocal tract is essential for studies on vocal-fold behavior and linguistic research. This paper proposes a speech production analysis method using physics-informed neural networks (PINNs). The networks are trained directly on the governing equations of vocal-fold vibration and vocal-tract acoustics. Vocal-fold collisions introduce nondifferentiability and vanishing gradients, challenging phenomena for PINNs. We demonstrate, however, that introducing a differentiable approximation function enables the analysis of vocal-fold vibrations within the PINN framework. The period of self-excited vocal-fold vibration is generally unknown. We show that by treating the period as a learnable network parameter, a periodic solution can be obtained. Furthermore, by implementing the coupling between glottal flow and vocal-tract acoustics as a hard constraint, glottis-tract interaction is achieved without additional loss terms. We confirmed the method's validity through forward and inverse analyses, demonstrating that the glottal flow rate, vocal-fold vibratory state, and subglottal pressure can be simultaneously estimated from speech signals. Notably, the same network architecture can be applied to both forward and inverse analyses, highlighting the versatility of this approach. The proposed method inherits the advantages of PINNs, including mesh-free computation and the natural incorporation of nonlinearities, and thus holds promise for a wide range of applications.
Problem

Research questions and friction points this paper is trying to address.

Analyzing vocal-fold vibrations with physics-informed neural networks despite collision-induced nondifferentiability
Determining unknown self-excited vocal-fold vibration periods as learnable network parameters
Simultaneously estimating glottal flow, vocal-fold state, and subglottal pressure from speech signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

PINNs trained directly on vocal system governing equations
Differentiable approximation resolves vocal-fold collision challenges
Hard constraint coupling enables glottis-tract interaction without losses
🔎 Similar Papers
No similar papers found.
Kazuya Yokota
Kazuya Yokota
Nagaoka University of Technology
AcousticsPhysics-informed neural networksMachine Learning
Ryosuke Harakawa
Ryosuke Harakawa
Department of Electrical, Electronics and Information Engineering, Nagaoka University of Technology
M
Masaaki Baba
Department of Mechanical Engineering, Nagaoka University of Technology, 1603-1, Kamitomioka, Nagaoka, Niigata, Japan
M
Masahiro Iwahashi
Department of Electrical, Electronics and Information Engineering, Nagaoka University of Technology