Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding datasets are predominantly constrained to Earth-gravity scenarios, limiting their applicability to safety-critical visual perception in microgravity space missions. To address this gap, we introduce MicroG-4M—the first benchmark for human activity understanding under microgravity—comprising 4,759 real-space-mission and high-fidelity simulated videos, 50 action classes, 1,238 contextual descriptions, and over 7,000 visual question-answer pairs. It supports three core tasks: fine-grained action recognition, temporal video captioning, and visual question answering. We systematically establish the first multi-dimensional evaluation framework for microgravity video understanding, jointly assessing spatial localization and semantic reasoning. Leveraging spatiotemporal modeling, end-to-end captioning, and vision-language alignment architectures, we develop state-of-the-art baselines. The dataset, annotations, and code are fully open-sourced. Unified benchmarking reveals significant generalization bottlenecks of current models in microgravity environments.

Technology Category

Application Category

📝 Abstract
Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.
Problem

Research questions and friction points this paper is trying to address.

Addressing video understanding gaps in microgravity environments
Developing a benchmark for human activity analysis in space
Enhancing domain-robust vision systems for space applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MicroG-4M benchmark for microgravity
Covers 50 actions with 4,759 clips
Supports action recognition, captioning, and VQA