WAT: Online Video Understanding Needs Watching Before Thinking

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tension between long-term temporal memory and real-time inference in existing video large language models under streaming scenarios. The authors propose WAT, a two-stage framework that first constructs hierarchical memory—comprising a short-term buffer and a fixed-capacity long-term memory—during a “Watch” phase, then performs cross-temporal reasoning by retrieving relevant information from long-term memory conditioned on the current query in a subsequent “Think” phase. Key innovations include the Watch-then-Think mechanism, a redundancy-aware eviction strategy for long-term memory, and a context-aware historical frame retrieval method. The study also introduces WAT-85K, the first dataset tailored for streaming video understanding. Evaluated on StreamingBench and OVO-Bench, the model achieves 77.7% and 55.2% accuracy, respectively, significantly outperforming existing open-source online video LLMs while meeting real-time processing requirements.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown strong capabilities in image understanding, motivating recent efforts to extend them to video reasoning. However, existing Video LLMs struggle in online streaming scenarios, where long temporal context must be preserved under strict memory constraints. We propose WAT (Watching Before Thinking), a two-stage framework for online video reasoning. WAT separates processing into a query-independent watching stage and a query-triggered thinking stage. The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) that buffers recent frames and a fixed-capacity Long-Term Memory (LTM) that maintains a diverse summary of historical content using a redundancy-aware eviction policy. In the thinking stage, a context-aware retrieval mechanism combines the query with the current STM context to retrieve relevant historical frames from the LTM for cross-temporal reasoning. To support training for online video tasks, we introduce WAT-85K, a dataset containing streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting. Experiments show that WAT achieves state-of-the-art performance on online video benchmarks, including 77.7% accuracy on StreamingBench and 55.2% on OVO-Bench, outperforming existing open-source online Video LLMs while operating at real-time frame rates.
Problem

Research questions and friction points this paper is trying to address.

online video understanding
temporal context
memory constraints
streaming video
video reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

online video understanding
hierarchical memory system
redundancy-aware eviction
context-aware retrieval
streaming video reasoning
🔎 Similar Papers
No similar papers found.
Z
Zifan Han
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, P. R. China
Hongbo Sun
Hongbo Sun
Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院,TeleAI), Peking University
Fine-grained visual analysisMulti-modal understandingMachine learning
J
Jinglin Xu
University of Science and Technology Beijing, Beijing 100083, P. R. China
Canhui Tang
Canhui Tang
Xi'an Jiaotong University
Computer Vision
Y
Yulong Lei
University of Science and Technology Beijing, Beijing 100083, P. R. China
X
Xuchong Zhang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, P. R. China
Hongbin Sun
Hongbin Sun
Xi'an Jiaotong University
Computer ArchitectureVLSI Circuit
Z
Zhongjiang He
Institute of Artificial Intelligence (TeleAI), China Telecom, 31 Jinrong Ave., Xicheng District, Beijing 100033, P. R. China
Hao Sun
Hao Sun
Central China Normal University
computer visionhyperspectral image classificationremote sensing scene classification