StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing VideoQA datasets suffer from two critical limitations: (1) static answer annotations that fail to capture temporal evolution, and (2) absence of reasoning process annotations, hindering interpretability and logical reasoning capability. To address these, we propose StreamingCoT—the first temporally evolving, multimodal Chain-of-Thought (CoT) dataset for streaming video question answering. Our method models dynamic answer evolution via second-level dense video descriptions and time-dependent semantic segmentation; introduces a novel dynamic hierarchical annotation framework featuring human-validated object state-transition reasoning paths; and integrates keyframe-semantic alignment, LLM-driven state-transition reasoning, and similarity-based temporal segmentation to ensure spatiotemporal awareness and logical coherence. We publicly release the StreamingCoT dataset and an open-source toolkit, enabling substantial improvements in temporal understanding, complex multi-step reasoning, and model interpretability under dynamic video settings.

Technology Category

Application Category

📝 Abstract

The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.

Problem

Research questions and friction points this paper is trying to address.

Addresses temporal dynamics limitations in streaming VideoQA datasets

Introduces multimodal Chain-of-Thought reasoning for video understanding

Provides explicit reasoning annotations to enhance model interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic hierarchical annotation for temporal video segments

Spatiotemporal object extraction via keyframe semantic alignment

Object state transition-based reasoning paths generation

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models