🤖 AI Summary
3D Scene Question Answering (3D SQA) faces core challenges including dataset heterogeneity, inefficient multimodal fusion, and inconsistent task formulation. Method: This paper presents the first systematic survey of 3D SQA, establishing a standardized analytical framework covering datasets, methodologies, and evaluation protocols. It introduces the first taxonomy of 3D SQA methods, unifying diverse 3D representations—point clouds, voxels, and NeRF—and integrating 3D visual understanding, natural language processing, and large language model techniques such as instruction tuning and zero-shot transfer. Contribution/Results: The survey identifies three critical bottlenecks: insufficient data standardization, weak cross-modal alignment, and the absence of embodied tasks. It proposes three future directions: unified cross-benchmark datasets, explicit cross-modal alignment mechanisms, and embodied-task extensions. This work provides both theoretical foundations and practical paradigms for semantic understanding and interactive reasoning in 3D environments.
📝 Abstract
3D Scene Question Answering (3D SQA) represents an interdisciplinary task that integrates 3D visual perception and natural language processing, empowering intelligent agents to comprehend and interact with complex 3D environments. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. However, this rapid progress introduces challenges, particularly in achieving unified analysis and comparison across datasets and baselines. This paper presents the first comprehensive survey of 3D SQA, systematically reviewing datasets, methodologies, and evaluation metrics while highlighting critical challenges and future opportunities in dataset standardization, multimodal fusion, and task design.