VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 2D vision-language models (VLMs) lack inherent 3D spatial reasoning capability. Method: This paper introduces voxel slicing representation—partitioning a 3D voxel grid into orthogonal 2D slices along principal axes, feeding each slice to a standard 2D image encoder (e.g., CLIP or Flamingo), and jointly optimizing cross-slice feature aggregation and multimodal alignment to enable end-to-end modeling of “voxel semantics” (object identity, color, and spatial position). Contribution/Results: Our approach eliminates the need for complex, parameter-heavy 3D architectures and is the first to empower pre-trained 2D VLMs to directly interpret 3D structure via lightweight slicing. Evaluated on ScanNet and NYUv2, it achieves substantial gains in 3D scene understanding accuracy while accelerating inference by 3.2× over dedicated 3D-VLMs and reducing model parameters by 87%, establishing a new paradigm for efficient 3D-to-2D cross-modal understanding.

Technology Category

Application Category

📝 Abstract
Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract"voxel semantics"-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D understanding in 2D vision-language models
Extracting semantic meaning from voxel representations
Leveraging 2D VLMs for efficient 3D spatial comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses 2D Vision-Language Model for voxel semantics
Slices voxel space along primary axis for processing
Leverages pre-trained 2D VLMs for 3D understanding
🔎 Similar Papers
No similar papers found.
Alan Dao
Alan Dao
AI Researcher
Artificial Intelligence
N
Norapat Buppodom
Menlo Research