TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the limitations of unimodal representation and insufficient real-time performance in multi-task learning for advanced driver assistance systems (ADAS), this paper proposes an efficient multimodal collaborative understanding framework. Methodologically, it introduces a two-stage architecture: (1) a Mamba-based spatiotemporal feature extraction network for multi-view inputs, incorporating forward-backward scanning and global-local spatial attention; and (2) a task-adaptive gated multimodal fusion module to mitigate negative transfer across tasks. The framework uniquely integrates four perception tasks—driver emotion, driver behavior, traffic scene understanding, and vehicle state estimation. Evaluated on the AIDE benchmark, it achieves state-of-the-art (SOTA) performance across all four tasks, with only 6.0 million parameters and an inference speed of 142.32 FPS—demonstrating superior accuracy, ultra-low latency, and model compactness.

Technology Category

Application Category

📝 Abstract

Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM^3-Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second component, the MTL-based gated multimodal feature integrator (MGMI), employs task-specific multi-gating modules to adaptively highlight the most relevant modality features for each task, effectively alleviating the negative transfer problem in MTL. Evaluation on the AIDE dataset, our proposed model achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142.32 FPS inference speed. Rigorous ablation studies further validate the effectiveness of the proposed framework and the independent contributions of each module. The code is available on https://github.com/Wenzhuo-Liu/TEM3-Learning.

Problem

Research questions and friction points this paper is trying to address.

Overcomes single-modality limits in assistive driving MTL

Improves real-time efficiency in multimodal MTL frameworks

Addresses negative transfer in multi-task feature integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based multi-view temporal-spatial feature extraction

Gated multimodal feature integration for MTL

Lightweight architecture with high-speed inference

🔎 Similar Papers

No similar papers found.

Authors to Follow