Context-Enhanced Memory-Refined Transformer for Online Action Detection

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing online action detection (OAD) methods suffer from inconsistent short-term memory lengths between training and inference: training employs truncated memory, whereas inference relies on full memory, leading to learning bias and performance bottlenecks. To address this, we propose the Context-Aware Memory Refinement Transformer (CAMR), the first framework introducing a near-past contextual encoder and a near-future generative memory decoder to explicitly bridge the training–inference gap. CAMR integrates multi-granularity temporal memory modeling, dynamic context enhancement, and generative future-frame supervision. Evaluated on THUMOS’14, CrossTask, and EPIC-Kitchens-100, it achieves state-of-the-art performance in both online action detection and action anticipation. Crucially, CAMR establishes the first unified paradigm for consistent memory modeling across training and inference in OAD, eliminating the structural inconsistency inherent in prior approaches.

Technology Category

Application Category

📝 Abstract

Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context. We identify a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance. CMeRT achieves state-of-the-art in online detection and anticipation on THUMOS'14, CrossTask, and EPIC-Kitchens-100.

Problem

Research questions and friction points this paper is trying to address.

Addresses training-inference discrepancy in OAD methods

Improves frame representations with near-past context

Enhances performance using near-future generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-enhanced encoder improves frame representations

Memory-refined decoder leverages near-future generation

Combines short- and long-term memories effectively

🔎 Similar Papers

No similar papers found.