Multimodal Foundation Model for Cross-Modal Retrieval and Activity Recognition Tasks

πŸ“… 2025-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current multimodal foundation models predominantly rely on first-person videos and text, limiting their capacity for fine-grained, full-body human activity understanding. This work introduces AURA-MFM, the first four-modal foundation model unifying third-person video, motion capture (MoCap), inertial measurement unit (IMU) sensor data, and textβ€”thereby overcoming single-view constraints. Methodologically, we design a dedicated Transformer-based IMU encoder to model high-frequency temporal signals and propose a multimodal representation alignment mechanism that jointly leverages cross-modal attention and contrastive learning to ensure semantic consistency across modalities. In zero-shot action recognition, AURA-MFM achieves an F1-score of 0.6226 (+7.3Γ— relative improvement) and accuracy of 73.20% (+2.7Γ—), significantly outperforming prior art. It also establishes new state-of-the-art results in cross-modal retrieval and recognition. Overall, AURA-MFM provides a scalable, general-purpose multimodal foundation architecture for holistic human activity understanding.

Technology Category

Application Category

πŸ“ Abstract
In recent years, the widespread adoption of wearable devices has highlighted the growing importance of behavior analysis using IMU. While applications span diverse fields such as healthcare and robotics, recent studies have increasingly focused on multimodal analysis, in addition to unimodal analysis. Several studies have proposed multimodal foundation models that incorporate first-person video and text data; however, these models still fall short in providing a detailed analysis of full-body human activity. To address this limitation, we propose Activity Understanding and Representations Alignment - Multimodal Foundation Model (AURA-MFM), a foundational model integrating four modalities: third-person video, motion capture, IMU, and text. By incorporating third-person video and motion capture data, the model enables a detailed and multidimensional understanding of human activity, which first-person perspectives alone fail to capture. Additionally, a Transformer-based IMU encoder is employed to enhance the model's overall performance. Experimental evaluations on retrieval and activity recognition tasks demonstrate that our model surpasses existing methods. Notably, in the zero-shot classification for action recognition, our method achieved significantly higher performance, with an F1-score of 0.6226 and an accuracy of 0.7320, whereas the existing method recorded an F1-score of 0.0747 and an accuracy of 0.1961.
Problem

Research questions and friction points this paper is trying to address.

Enhancing human activity analysis with multimodal data integration
Addressing limitations in full-body activity understanding from first-person views
Improving cross-modal retrieval and activity recognition performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates third-person video and motion capture
Uses Transformer-based IMU encoder
Combines four modalities for activity analysis
πŸ”Ž Similar Papers
No similar papers found.
K
Koki Matsuishi
Kyushu Institute of Technology
K
Kosuke Ukita
Kyushu Institute of Technology
Tsuyoshi Okita
Tsuyoshi Okita
Kyushu Institute of Technology
Generative AIDeep LearningArtificial IntelligenceIoTNatural Language Processing