SeGPruner: Semantic-Geometric Visual Token Pruner for 3D Question Answering

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency in multi-view 3D visual-language models for question answering, where redundant visual tokens hinder inference speed. Existing token pruning methods struggle to simultaneously preserve semantically critical information and ensure adequate 3D spatial coverage. To overcome this limitation, the paper introduces the first visual token pruning framework that jointly leverages explicit semantic importance and 3D geometric structure. The approach identifies semantically salient tokens through attention mechanisms and incorporates spatially diverse tokens based on 3D geometric distances, thereby achieving a synergistic optimization between retaining semantic evidence and maintaining scene coverage. Evaluated on ScanQA and OpenEQA benchmarks, the method reduces visual tokens by 91% and inference latency by 86%, while sustaining competitive 3D question-answering accuracy.
📝 Abstract
Vision-language models (VLMs) have been widely adopted for 3D question answering (3D QA). In typical pipelines, visual tokens extracted from multiple viewpoints are concatenated with language tokens and jointly processed by a large language model (LLM) for inference. However, aggregating multi-view observations inevitably introduces severe token redundancy, leading to an overly large visual token set that significantly hinders inference efficiency under constrained token budgets. Visual token pruning has emerged as a prevalent strategy to address this issue. Nevertheless, most existing pruners are primarily tailored to 2D inputs or rely on indirect geometric cues, which limits their ability to explicitly retain semantically critical objects and maintain sufficient spatial coverage for robust 3D reasoning. In this paper, we propose SeGPruner, a semantic-aware and geometry-guided token reduction framework for efficient 3D QA with multi-view images. Specifically, SeGPruner first preserves semantically salient tokens through an attention-based importance module (Saliency-aware Token Selector), ensuring that object-critical evidence is retained. It then complements these tokens with spatially diverse ones via a geometry-guided selector (Geometry-aware Token Diversifier), which jointly considers semantic relevance and 3D geometric distance. This cooperation between saliency preservation and geometry-guided diversification balances object-level evidence and global scene coverage under aggressive token reduction. Extensive experiments on ScanQA and OpenEQA demonstrate that SeGPruner substantially improves inference efficiency, reducing the visual token budget by 91% and inference latency by 86%, while maintaining competitive performance in 3D reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

3D question answering
visual token pruning
semantic-geometric representation
multi-view redundancy
inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-aware pruning
geometry-guided selection
visual token reduction
3D question answering
multi-view reasoning
🔎 Similar Papers
No similar papers found.
W
Wenli Li
Shanghai University, Shanghai, China
K
Kai Zhao
Shanghai University, Shanghai, China
H
Haoran Jiang
Shanghai University, Shanghai, China
E
Enquan Yang
Shanghai University, Shanghai, China
Y
Yi Su
Shanghai University, Shanghai, China
Dan Zeng
Dan Zeng
Sun Yat-sen University
Biometricscomputer visiondeep learning