Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

📅 2023-11-30

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the challenges of structured understanding and cross-domain generalization for long-horizon, multimodal procedural videos—exemplified by extravehicular activities (EVAs) aboard the International Space Station. We introduce the first benchmark tailored to real-world space operations, featuring long-duration, multimodal procedural video understanding with two core tasks: step identification and video question answering. To enable zero-shot or few-shot domain adaptation without fine-tuning, we propose a summary-guided adaptation method that integrates lightweight multimodal fusion (vision + speech), temporal action segmentation, and summary-informed reasoning. Experiments reveal substantial performance gaps of existing models on cross-domain long-video understanding; our approach achieves up to a 14.2% absolute accuracy improvement under no-fine-tuning conditions. The benchmark is publicly released, establishing a new standard for multimodal procedural video understanding in aerospace applications.

📝 Abstract

Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning. The Spacewalk-18 benchmark is released at https://brown-palm.github.io/Spacewalk-18/.

Problem

Research questions and friction points this paper is trying to address.

Benchmark for multimodal long-form video understanding in novel domains

Tasks: step recognition and video question answering in spacewalks

Challenges: domain generalization and long temporal context utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal video understanding with visual and speech data

Long-form procedural video analysis for skill acquisition

Summarization technique for domain generalization without fine-tuning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs