🤖 AI Summary
Existing VideoQA datasets suffer from two critical limitations: (1) static answer annotations that fail to capture temporal evolution, and (2) absence of reasoning process annotations, hindering interpretability and logical reasoning capability. To address these, we propose StreamingCoT—the first temporally evolving, multimodal Chain-of-Thought (CoT) dataset for streaming video question answering. Our method models dynamic answer evolution via second-level dense video descriptions and time-dependent semantic segmentation; introduces a novel dynamic hierarchical annotation framework featuring human-validated object state-transition reasoning paths; and integrates keyframe-semantic alignment, LLM-driven state-transition reasoning, and similarity-based temporal segmentation to ensure spatiotemporal awareness and logical coherence. We publicly release the StreamingCoT dataset and an open-source toolkit, enabling substantial improvements in temporal understanding, complex multi-step reasoning, and model interpretability under dynamic video settings.
📝 Abstract
The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.