TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of unimodal representation and insufficient real-time performance in multi-task learning for advanced driver assistance systems (ADAS), this paper proposes an efficient multimodal collaborative understanding framework. Methodologically, it introduces a two-stage architecture: (1) a Mamba-based spatiotemporal feature extraction network for multi-view inputs, incorporating forward-backward scanning and global-local spatial attention; and (2) a task-adaptive gated multimodal fusion module to mitigate negative transfer across tasks. The framework uniquely integrates four perception tasks—driver emotion, driver behavior, traffic scene understanding, and vehicle state estimation. Evaluated on the AIDE benchmark, it achieves state-of-the-art (SOTA) performance across all four tasks, with only 6.0 million parameters and an inference speed of 142.32 FPS—demonstrating superior accuracy, ultra-low latency, and model compactness.

Technology Category

Application Category

📝 Abstract
Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM^3-Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second component, the MTL-based gated multimodal feature integrator (MGMI), employs task-specific multi-gating modules to adaptively highlight the most relevant modality features for each task, effectively alleviating the negative transfer problem in MTL. Evaluation on the AIDE dataset, our proposed model achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142.32 FPS inference speed. Rigorous ablation studies further validate the effectiveness of the proposed framework and the independent contributions of each module. The code is available on https://github.com/Wenzhuo-Liu/TEM3-Learning.
Problem

Research questions and friction points this paper is trying to address.

Overcomes single-modality limits in assistive driving MTL
Improves real-time efficiency in multimodal MTL frameworks
Addresses negative transfer in multi-task feature integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based multi-view temporal-spatial feature extraction
Gated multimodal feature integration for MTL
Lightweight architecture with high-speed inference
🔎 Similar Papers
No similar papers found.
W
Wenzhuo Liu
Faculty of Marine Science and Technology, Beijing Institute of Technology, Zhuhai, China
Yicheng Qiao
Yicheng Qiao
PhD Student of Computer Science, Michigan State University
Sports performance analysisData analysisMotion Analysis
Z
Zhen Wang
Faculty of Marine Science and Technology, Beijing Institute of Technology, Zhuhai, China
Qiannan Guo
Qiannan Guo
Beijing Normal University
Computer Vision 、Action Recognition、Autonomous Driving
Z
Zilong Chen
State Key Laboratory of Intelligent Technology and Systems and Department of Computer Science and Technology, Tsinghua University, Beijing, China
Meihua Zhou
Meihua Zhou
University of Chinese Academy of Sciences
Intelligent medicineembodied intelligencehuman-computer interactionadvanced assisted driving
X
Xinran Li
School of Engineering and Applied Science, Yale University, New Haven, CT, USA
L
Letian Wang
University of Toronto, Toronto, Canada
Z
Zhiwei Li
Beijing University of Chemical Technology, Beijing, China
Huaping Liu
Huaping Liu
Professor of Electrical Engineering, Oregon State University
Communication theorywireless communicationssignal processingsensor networksinformation security
Wenshuo Wang
Wenshuo Wang
Professor, Beijing Institute of Technology (BIT) | Research Fellow, UC Berkeley, CMU, McGill
Human-Robot InteractionAutonomous DrivingBayesian LearningHuman Factors