StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

๐Ÿ“… 2026-06-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

200K/year
๐Ÿค– AI Summary
This work addresses the limitations of existing video question-answering methods in handling complex narratives, which often struggle with long-range dependencies, diverse question types, and fine-grained story elements, compounded by the scarcity of large-scale, high-quality datasets. To overcome these challenges, the authors propose StoryMindv2, a multi-agent collaborative framework that integrates supervised guided generation with a multi-reviewer voting mechanism to automatically construct deep video understanding data across diverse film and television genres. They introduce StoryVideoQAโ€”the largest dataset to date, comprising 363K question-answer pairs grounded in 393.2 hours of video content. Additionally, they design the PlotTree model, which reconstructs narratives into hierarchical plot trees, substantially enhancing long-range character association and coherent reasoning. Experiments demonstrate that this approach significantly outperforms current methods in complex story comprehension.
๐Ÿ“ Abstract
Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets.These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/
Problem

Research questions and friction points this paper is trying to address.

Deep Video Understanding
Video Question Answering
Complex Storylines
Long-range Video Content
Storyline Comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

deep video understanding
multi-agent collaboration
auto-generated dataset
hierarchical plot structure
long-range reasoning
๐Ÿ”Ž Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
2024-05-22Annual Meeting of the Association for Computational LinguisticsCitations: 2
Z
Zhengqian Wu
School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China.; National Engineering Research Center for Multimedia Software, China.; Hubei Key Laboratory of Multimedia and Network Communication Engineering, China.
Z
Zhixian Liu
School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China.
A
Aodong Chen
School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China.
J
Jingyang Zhang
School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China.
R
Ruizhe Li
School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China.; National Engineering Research Center for Multimedia Software, China.; Hubei Key Laboratory of Multimedia and Network Communication Engineering, China.
H
Hanlin Ge
School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China.
Zhongyuan Wang
Zhongyuan Wang
Wuhan University
Chunxia Xiao
Chunxia Xiao
Professor of Computer Science, Wuhan University
Computer VisionComputer GraphicsMachine learning
Chao Liang
Chao Liang
Professor of Computer Science, Wuhan University
computer visionpattern recognitionmultimediaHCI