Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work addresses the challenge of preserving task-relevant details under limited input conditions in zero-shot 3D scene understanding to improve visual question answering performance. The authors propose KeyVT, a novel method that jointly evaluates view importance through semantic content and geometric position, and leverages an optimal transport framework to select the most representative visual tokens across views in a non-redundant manner. This enables the construction of spatially consistent and task-aware contextual representations. By integrating camera-parameter-aware pixel features, multi-view sampling, and a pre-trained vision-language model, KeyVT significantly outperforms existing training-free approaches on three mainstream 3D visual question answering benchmarks, achieving performance comparable to methods that require task-specific training.
📝 Abstract
Recently, zero-shot 3D scene understanding via 2D Vision-Language Models (VLMs) has gained increasing research interest due to their promising spatial reasoning capabilities. Typically, multiple 2D views are sampled from a 3D point cloud and fed into pre-trained VLMs to answer a given question. This paradigm highlights the critical role of input context quality and raises the challenge of retaining as many task-relevant 3D details as possible under a limited input budget. We propose \texttt{KeyVT}, a hierarchical approach for input context collection at both the view and token levels. Specifically, we combine pixel features with camera parameters and assess view importance based on both semantic content and geometric position, resulting in spatially consistent and task-relevant views. Furthermore, we address redundancy among patches across selected views by identifying representative tokens under the optimal transport (OT) framework, where view tokens and key tokens are formulated as two discrete distributions in the embedding space. These key tokens are expected to cover all view features by minimizing the OT distance. We evaluate our framework on three widely used benchmarks, demonstrating significant improvements over existing tuning-free methods and performance comparable to training-based approaches.
Problem

Research questions and friction points this paper is trying to address.

Zero-Shot 3D Question Answering
Vision-Language Models
3D Scene Understanding
Input Context Optimization
View Selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot 3D QA
Hierarchical View Selection
Optimal Transport
Vision-Language Models
Token Compression
🔎 Similar Papers
No similar papers found.