🤖 AI Summary
Existing online action detection (OAD) methods suffer from inconsistent short-term memory lengths between training and inference: training employs truncated memory, whereas inference relies on full memory, leading to learning bias and performance bottlenecks. To address this, we propose the Context-Aware Memory Refinement Transformer (CAMR), the first framework introducing a near-past contextual encoder and a near-future generative memory decoder to explicitly bridge the training–inference gap. CAMR integrates multi-granularity temporal memory modeling, dynamic context enhancement, and generative future-frame supervision. Evaluated on THUMOS’14, CrossTask, and EPIC-Kitchens-100, it achieves state-of-the-art performance in both online action detection and action anticipation. Crucially, CAMR establishes the first unified paradigm for consistent memory modeling across training and inference in OAD, eliminating the structural inconsistency inherent in prior approaches.
📝 Abstract
Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context. We identify a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance. CMeRT achieves state-of-the-art in online detection and anticipation on THUMOS'14, CrossTask, and EPIC-Kitchens-100.