UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets

πŸ“… 2025-07-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current spoken language understanding (SLU) approaches predominantly rely on task-specific models, leading to system redundancy, weak cross-task synergy, and underutilization of heterogeneous data. This work proposes a unified generative SLU framework that reformulates automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA) as sequence-to-sequence generation tasks under shared latent representations. The framework enables end-to-end training via unified task prompting, joint decoding, and cross-task data fusion. Its core contributions are: (1) a task-agnostic unified input-output representation; (2) a scalable generative multi-task architecture compatible with large language models; and (3) explicit modeling of semantic interdependencies among tasks. Extensive experiments on multiple public benchmarks demonstrate significant improvements over state-of-the-art methods, validating the framework’s effectiveness and generalizability in real-world multimodal speech scenarios.

Technology Category

Application Category

πŸ“ Abstract
Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications, enabling machines to comprehend spoken language in scenarios such as meetings, interviews, and customer service interactions. SLU encompasses multiple tasks, including Automatic Speech Recognition (ASR), spoken Named Entity Recognition (NER), and spoken Sentiment Analysis (SA). However, existing methods often rely on separate model architectures for individual tasks such as spoken NER and SA, which increases system complexity, limits cross-task interaction, and fails to fully exploit heterogeneous datasets available across tasks. To address these limitations, we propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture. Specifically, we propose a unified representation for diverse SLU tasks, enabling full utilization of heterogeneous datasets across multiple tasks. Built upon this representation, we propose a unified generative method that jointly models ASR, spoken NER, and SA tasks, enhancing task interactions and enabling seamless integration with large language models to harness their powerful generative capabilities. Extensive experiments on public SLU datasets demonstrate the effectiveness of our approach, achieving superior SLU performance compared to several benchmark methods, making it well-suited for real-world speech-based multimedia scenarios. We will release all code and models at github to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Unified modeling of multiple SLU tasks
Overcoming limitations of separate task architectures
Enhancing cross-task interaction and dataset utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for multiple SLU tasks
Generative method integrating ASR, NER, SA
Utilizes heterogeneous datasets across tasks
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhichao Sheng
Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, Suzhou, China
Shilin Zhou
Shilin Zhou
School of Computer Science and Technology, Soochow University
Machine LearningNatural Language Processing
C
Chen Gong
Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, Suzhou, China
Z
Zhenghua Li
Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, Suzhou, China