🤖 AI Summary
Existing video understanding datasets are predominantly constrained to Earth-gravity scenarios, limiting their applicability to safety-critical visual perception in microgravity space missions. To address this gap, we introduce MicroG-4M—the first benchmark for human activity understanding under microgravity—comprising 4,759 real-space-mission and high-fidelity simulated videos, 50 action classes, 1,238 contextual descriptions, and over 7,000 visual question-answer pairs. It supports three core tasks: fine-grained action recognition, temporal video captioning, and visual question answering. We systematically establish the first multi-dimensional evaluation framework for microgravity video understanding, jointly assessing spatial localization and semantic reasoning. Leveraging spatiotemporal modeling, end-to-end captioning, and vision-language alignment architectures, we develop state-of-the-art baselines. The dataset, annotations, and code are fully open-sourced. Unified benchmarking reveals significant generalization bottlenecks of current models in microgravity environments.
📝 Abstract
Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.