EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic inconsistency and feature artifacts in first-person scene understanding—caused by dynamic interactions, frequent occlusions, and multi-view viewpoint changes—this paper proposes an open-vocabulary 3D semantic reconstruction framework. Methodologically, it integrates multi-view geometry, instance-level temporal modeling, and language-guided semantic embedding. Key contributions include: (1) a multi-view consistent instance feature aggregation mechanism to mitigate semantic drift induced by viewpoint transitions; and (2) an instance-aware spatiotemporal transient prediction module that jointly leverages SAM2-based segmentation and tracking with language-embedded Gaussian Splatting to suppress dynamic interference artifacts. Evaluated on the ADT dataset, the method achieves an 8.2% improvement in localization accuracy and a 3.7% gain in segmentation mIoU, establishing new state-of-the-art performance for open-vocabulary first-person 3D semantic understanding.

Technology Category

Application Category

📝 Abstract
Egocentric scenes exhibit frequent occlusions, varied viewpoints, and dynamic interactions compared to typical scene understanding tasks. Occlusions and varied viewpoints can lead to multi-view semantic inconsistencies, while dynamic objects may act as transient distractors, introducing artifacts into semantic feature modeling. To address these challenges, we propose EgoSplat, a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding. A multi-view consistent instance feature aggregation method is designed to leverage the segmentation and tracking capabilities of SAM2 to selectively aggregate complementary features across views for each instance, ensuring precise semantic representation of scenes. Additionally, an instance-aware spatial-temporal transient prediction module is constructed to improve spatial integrity and temporal continuity in predictions by incorporating spatial-temporal associations across multi-view instances, effectively reducing artifacts in the semantic reconstruction of egocentric scenes. EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets, outperforming existing methods with a 8.2% improvement in localization accuracy and a 3.7% improvement in segmentation mIoU on the ADT dataset, and setting a new benchmark in open-vocabulary egocentric scene understanding. The code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addresses multi-view semantic inconsistencies in egocentric scenes.
Reduces artifacts from dynamic objects in semantic feature modeling.
Improves spatial integrity and temporal continuity in scene predictions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-embedded 3D Gaussian Splatting framework
Multi-view consistent instance feature aggregation
Instance-aware spatial-temporal transient prediction module
🔎 Similar Papers
No similar papers found.