🤖 AI Summary
To address two key challenges in temporal action localization (TAL) for untrimmed videos—imprecise boundary localization due to ambiguous action boundaries and insufficient multi-scale contextual fusion—this paper proposes TBT-Former. Methodologically, it introduces (1) a boundary probability distribution regression framework that models boundary prediction as an uncertainty estimation problem; (2) a high-capacity bidirectional Transformer backbone to enhance temporal modeling; and (3) a cross-scale feature pyramid with top-down fusion pathways to enable multi-granularity contextual collaboration. Technical enhancements include expanded MLP dimensions, multi-head attention, and generalized Focal Loss to optimize distribution learning. Evaluated on THUMOS14 and EPIC-Kitchens 100, TBT-Former achieves state-of-the-art performance; it remains highly competitive on ActivityNet-1.3. Notably, it significantly improves localization accuracy for actions with ambiguous temporal boundaries.
📝 Abstract
Temporal Action Localization (TAL) remains a fundamental challenge in video understanding, aiming to identify the start time, end time, and category of all action instances within untrimmed videos. While recent single-stage, anchor-free models like ActionFormer have set a high standard by leveraging Transformers for temporal reasoning, they often struggle with two persistent issues: the precise localization of actions with ambiguous or "fuzzy" temporal boundaries and the effective fusion of multi-scale contextual information. In this paper, we introduce the Temporal Boundary Transformer (TBT-Former), a new architecture that directly addresses these limitations. TBT-Former enhances the strong ActionFormer baseline with three core contributions: (1) a higher-capacity scaled Transformer backbone with an increased number of attention heads and an expanded Multi-Layer Perceptron (MLP) dimension for more powerful temporal feature extraction; (2) a cross-scale feature pyramid network (FPN) that integrates a top-down pathway with lateral connections, enabling richer fusion of high-level semantics and low-level temporal details; and (3) a novel boundary distribution regression head. Inspired by the principles of Generalized Focal Loss (GFL), this new head recasts the challenging task of boundary regression as a more flexible probability distribution learning problem, allowing the model to explicitly represent and reason about boundary uncertainty. Within the paradigm of Transformer-based architectures, TBT-Former advances the formidable benchmark set by its predecessors, establishing a new level of performance on the highly competitive THUMOS14 and EPIC-Kitchens 100 datasets, while remaining competitive on the large-scale ActivityNet-1.3. Our code is available at https://github.com/aaivu/In21-S7-CS4681-AML-Research-Projects/tree/main/projects/210536K-Multi-Modal-Learning_Video-Understanding