🤖 AI Summary
This work addresses the challenges of speech emotion recognition (SER) in real-world, spontaneous, and low-resource settings, where emotional expression complexity and technical limitations hinder performance. It presents the first systematic integration of automatic speech recognition (ASR) into the SER pipeline, enabling deep fusion of acoustic and textual modalities through joint modeling of speech signals and ASR-generated transcripts. By leveraging complementary information from both modalities, the proposed approach significantly improves recognition accuracy and system robustness under low-resource and spontaneous speech conditions. This advancement enhances the scalability and adaptability of SER systems in practical applications, paving the way for more effective deployment in real-life scenarios characterized by limited labeled data and naturalistic speech variability.
📝 Abstract
Speech Emotion Recognition (SER) plays a pivotal role in understanding human communication, enabling emotionally intelligent systems, and serving as a fundamental component in the development of Artificial General Intelligence (AGI). However, deploying SER in real-world, spontaneous, and low-resource scenarios remains a significant challenge due to the complexity of emotional expression and the limitations of current speech and language technologies. This thesis investigates the integration of Automatic Speech Recognition (ASR) into SER, with the goal of enhancing the robustness, scalability, and practical applicability of emotion recognition from spoken language.