A Video-grounded Dialogue Dataset and Metric for Event-driven Activities

๐Ÿ“… 2025-01-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Video dialogue tasks suffer from a lack of high-quality benchmarks and reliable evaluation methodologies. Method: This work introduces VDActโ€”a new benchmark comprising 1,000 complex long-form videos, 3,000 dialogues, and 30,000 question-answer pairsโ€”and proposes the first event-driven activity modeling paradigm for video dialogue. It further designs VDEval, a knowledge-graph-enhanced, conversation-level evaluation metric that jointly models video content summarization, multimodal contextual understanding, and dialogue history comprehension. Results: Experiments demonstrate that VDEval achieves significantly higher correlation with human judgments than conventional turn-level metrics. Moreover, it reveals critical bottlenecks in current vision foundation models regarding complex event reasoning. This work establishes a new benchmark, introduces a novel evaluation paradigm, and delivers key diagnostic insights for video understanding and dialogue generation.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. Unlike existing datasets, VDAct includes longer and more complex video sequences that depict a variety of event-driven activities that require advanced contextual understanding for accurate response generation. The dataset comprises 3,000 dialogues with over 30,000 question-and-answer pairs, derived from 1,000 videos with diverse activity scenarios. VDAct displays a notably challenging characteristic due to its broad spectrum of activity scenarios and wide range of question types. Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns.
Problem

Research questions and friction points this paper is trying to address.

Video Dialogue Dataset
Human Evaluation Standard
Video Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

VDAct Dataset
VDEval Method
Video Activity Understanding
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Wiradee Imrattanatrai
National Institute of Advanced Industrial Science and Technology (AIST)
M
Masaki Asada
National Institute of Advanced Industrial Science and Technology (AIST)
K
Kimihiro Hasegawa
Language Technologies Institute, Carnegie Mellon University
Zhi-Qi Cheng
Zhi-Qi Cheng
Assistant Professor @ UW | Graduate Faculty | Ex-CMU, Google, Microsoft | Intel & IBM PhD Fellowship
multimedia processingmultimedia understandingmultimodal foundation model
K
Ken Fukuda
National Institute of Advanced Industrial Science and Technology (AIST)
Teruko Mitamura
Teruko Mitamura
Research Professor of Language Technologies Institute, School of Computer Science, Carnegie Mellon
Natural Language ProcessingQuestion AnsweringJapanese NLPSemanticsEvents