Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

📅 2023-11-30
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of structured understanding and cross-domain generalization for long-horizon, multimodal procedural videos—exemplified by extravehicular activities (EVAs) aboard the International Space Station. We introduce the first benchmark tailored to real-world space operations, featuring long-duration, multimodal procedural video understanding with two core tasks: step identification and video question answering. To enable zero-shot or few-shot domain adaptation without fine-tuning, we propose a summary-guided adaptation method that integrates lightweight multimodal fusion (vision + speech), temporal action segmentation, and summary-informed reasoning. Experiments reveal substantial performance gaps of existing models on cross-domain long-video understanding; our approach achieves up to a 14.2% absolute accuracy improvement under no-fine-tuning conditions. The benchmark is publicly released, establishing a new standard for multimodal procedural video understanding in aerospace applications.
📝 Abstract
Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning. The Spacewalk-18 benchmark is released at https://brown-palm.github.io/Spacewalk-18/.
Problem

Research questions and friction points this paper is trying to address.

Benchmark for multimodal long-form video understanding in novel domains
Tasks: step recognition and video question answering in spacewalks
Challenges: domain generalization and long temporal context utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal video understanding with visual and speech data
Long-form procedural video analysis for skill acquisition
Summarization technique for domain generalization without fine-tuning
🔎 Similar Papers
No similar papers found.
R
Rohan Myer Krishnan
Brown University
Zitian Tang
Zitian Tang
Brown University
Artificial IntelligenceMultimodal Machine Learning
Z
Zhiqiu Yu
Brown University
C
Chen Sun
Brown University