🤖 AI Summary
Existing 2D vision-language models (VLMs) lack inherent 3D spatial reasoning capability. Method: This paper introduces voxel slicing representation—partitioning a 3D voxel grid into orthogonal 2D slices along principal axes, feeding each slice to a standard 2D image encoder (e.g., CLIP or Flamingo), and jointly optimizing cross-slice feature aggregation and multimodal alignment to enable end-to-end modeling of “voxel semantics” (object identity, color, and spatial position). Contribution/Results: Our approach eliminates the need for complex, parameter-heavy 3D architectures and is the first to empower pre-trained 2D VLMs to directly interpret 3D structure via lightweight slicing. Evaluated on ScanNet and NYUv2, it achieves substantial gains in 3D scene understanding accuracy while accelerating inference by 3.2× over dedicated 3D-VLMs and reducing model parameters by 87%, establishing a new paradigm for efficient 3D-to-2D cross-modal understanding.
📝 Abstract
Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract"voxel semantics"-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.