🤖 AI Summary
This work addresses the limitation of existing 3D multimodal large language models (3D-MLLMs), which predominantly focus on object-level understanding and struggle to capture fine-grained part structures critical for embodied interaction. To overcome this, we propose the first unified part-aware 3D-MLLM framework that enables joint understanding, reasoning, and localization of both objects and their constituent parts through part-level semantic representations and a hierarchical query mechanism. We introduce ScenePart, the first synthetic 3D scene dataset annotated with part-level labels, and devise a hierarchical segmentation query generation strategy to support part-level tasks. Experiments demonstrate that our approach significantly outperforms current models on part-level visual question answering and referring expression segmentation, while maintaining state-of-the-art performance on object-level vision-language tasks.
📝 Abstract
Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.