Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Existing video understanding datasets are predominantly constrained to Earth-gravity scenarios, limiting their applicability to safety-critical visual perception in microgravity space missions. To address this gap, we introduce MicroG-4M—the first benchmark for human activity understanding under microgravity—comprising 4,759 real-space-mission and high-fidelity simulated videos, 50 action classes, 1,238 contextual descriptions, and over 7,000 visual question-answer pairs. It supports three core tasks: fine-grained action recognition, temporal video captioning, and visual question answering. We systematically establish the first multi-dimensional evaluation framework for microgravity video understanding, jointly assessing spatial localization and semantic reasoning. Leveraging spatiotemporal modeling, end-to-end captioning, and vision-language alignment architectures, we develop state-of-the-art baselines. The dataset, annotations, and code are fully open-sourced. Unified benchmarking reveals significant generalization bottlenecks of current models in microgravity environments.

Technology Category

Application Category

📝 Abstract

Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.

Problem

Research questions and friction points this paper is trying to address.

Addressing video understanding gaps in microgravity environments

Developing a benchmark for human activity analysis in space

Enhancing domain-robust vision systems for space applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MicroG-4M benchmark for microgravity

Covers 50 actions with 4,759 clips

Supports action recognition, captioning, and VQA

🔎 Similar Papers

C3T: Cross-modal Transfer Through Time for Human Action Recognition