Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses fine-grained temporal action localization (TAL) in multi-view, multimodal videos—encompassing 360° panoramic, third-person, and egocentric perspectives. To tackle this challenge, we propose three key contributions: (1) the first adaptation of the Temporal Shift Module (TSM) to TAL, incorporating explicit background-class modeling and fixed-length, non-overlapping temporal segment classification; (2) a joint multi-task framework that simultaneously optimizes scene classification and action localization, enabling tight action–scene contextual co-modeling; and (3) a weighted model ensemble strategy to enhance prediction robustness across views and modalities. Our approach achieves first place in both the preliminary and extended rounds of the ICCV 2025 BinEgo-360 Challenge, demonstrating significant improvements in localization accuracy and cross-view consistency. The method advances state-of-the-art performance for multi-view, multimodal TAL by unifying temporal modeling, contextual reasoning, and ensemble-based uncertainty mitigation.

Technology Category

Application Category

📝 Abstract
We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
Problem

Research questions and friction points this paper is trying to address.

Extends Temporal Shift Module for temporal action localization
Handles multi-perspective and multi-modal video data
Integrates multi-task learning and ensemble for robust predictions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended Temporal Shift Module for action localization
Multi-task learning with scene classification
Weighted ensemble strategy for robust predictions
🔎 Similar Papers
No similar papers found.