LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses open-vocabulary, text-driven 3D instance segmentation. Methodologically, it proposes a multi-view consistent 3D mask generation framework that couples a frozen vision-language segmentation model (LISA) with a geometry-aware adapter layer, and introduces a differentiable reprojection loss—requiring no 3D text annotations—to enforce cross-view mask consistency. Leveraging LoRA fine-tuning of only 11.6M parameters, the method exploits RGB-D sequences and camera poses to impose geometric constraints and outputs RGBA prompts compatible with Gaussian splatting or textured mesh reconstruction. Contributions include: (i) the first zero-shot transfer approach enabling multi-view semantically consistent 3D segmentation; (ii) significant gains in open-vocabulary understanding—achieving up to +15.6 points in language-to-3D accuracy over single-view baselines on ScanRefer and Nr3D; and (iii) strong efficiency, data economy, and cross-category generalization.

Technology Category

Application Category

📝 Abstract

Text-driven 3D reconstruction demands a mask generator that simultaneously understands open-vocabulary instructions and remains consistent across viewpoints. We present LISA-3D, a two-stage framework that lifts language-image segmentation into 3D by retrofitting the instruction-following model LISA with geometry-aware Low-Rank Adaptation (LoRA) layers and reusing a frozen SAM-3D reconstructor. During training we exploit off-the-shelf RGB-D sequences and their camera poses to build a differentiable reprojection loss that enforces cross-view agreement without requiring any additional 3D-text supervision. The resulting masks are concatenated with RGB images to form RGBA prompts for SAM-3D, which outputs Gaussian splats or textured meshes without retraining. Across ScanRefer and Nr3D, LISA-3D improves language-to-3D accuracy by up to +15.6 points over single-view baselines while adapting only 11.6M parameters. The system is modular, data-efficient, and supports zero-shot deployment on unseen categories, providing a practical recipe for language-guided 3D content creation. Our code will be available at https://github.com/binisalegend/LISA-3D.

Problem

Research questions and friction points this paper is trying to address.

Lifts 2D language-image segmentation to 3D reconstruction

Ensures multi-view consistency without 3D-text supervision

Enables modular, data-efficient language-guided 3D content creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework lifting language-image segmentation to 3D

Uses geometry-aware LoRA layers and frozen SAM-3D reconstructor

Enforces multi-view consistency via differentiable reprojection loss

🔎 Similar Papers

Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model