EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key limitations in open-vocabulary 3D panoptic segmentation—namely, reliance on preprocessing pipelines, error propagation, and inconsistency between semantic and instance predictions—by introducing the first end-to-end feedforward framework that directly predicts 3D semantic and instance features from multi-view images. The core innovation lies in the bidirectional Ins2Sem and Sem2Ins mutual enhancement modules, which explicitly model consistency between semantic and instance representations. Integrated with multi-view feature fusion and a distillation-based training strategy, the method achieves state-of-the-art performance on benchmarks such as Replica, improving semantic mIoU by 13% over existing approaches. It also enables real-time inference at just one second per scene, offering an optimal balance of accuracy and efficiency for applications like robotic manipulation and 3D editing.
📝 Abstract
This paper introduces EPS3D, a new end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentation. Unlike existing methods relying on additional preprocessing, we design an end-to-end architecture, with a distillation-based training strategy on diverse 3D scenes to predict 3D-aware semantic and instance features from multi-view images, improving 3D consistency and avoiding error accumulation. We further propose a mutual enhancement module to enforce inherent semantic-instance consistency. By aligning semantics within instances (Ins2Sem) and refining instance features with semantic guidance (Sem2Ins), we achieve more coherent 3D scene understanding. Ultimately, EPS3D outperforms SOTA baselines on two benchmarks (e.g., +13% mIoU for semantics on Replica) with high efficiency (e.g., 1s per scene), supporting tasks like robotic manipulation and 3D scene editing.
Problem

Research questions and friction points this paper is trying to address.

3D panoptic segmentation
open-vocabulary
3D consistency
error accumulation
scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end
3D panoptic segmentation
mutual enhancement
open-vocabulary
feed-forward
🔎 Similar Papers
R
Runsong Zhu
The Chinese University of Hong Kong, HK SAR, China; Hong Kong Centre for Logistics Robotics, HK SAR, China
J
Jiaxin Guo
The Chinese University of Hong Kong, HK SAR, China; Hong Kong Centre for Logistics Robotics, HK SAR, China
Xiaoyang Guo
Xiaoyang Guo
Florida State University
Statistical Shape AnalysisGraphComputer VisionMachine Learning
Zhengzhe Liu
Zhengzhe Liu
Lingnan University
Computer Vision3D GenerationComputer GraphicsAI AgentAI4Sci
Ka-Hei Hui
Ka-Hei Hui
The Chinese University of Hong Kong
Computer Graphics3D VisionComputational Assembly
Wei Yin
Wei Yin
Staff Research Scientist, Horizon Robotics
World ModelGenerative AIPhysical AI
K
Kai Chen
The Chinese University of Hong Kong, HK SAR, China; Hong Kong Centre for Logistics Robotics, HK SAR, China
Wei Chen
Wei Chen
HKUST
Computer VisionVision-Language
W
Weiqiang Ren
Horizon Robotics, China
Yunhui Liu
Yunhui Liu
Nanjing University
Graph Machine Learning
P
Pheng-Ann Heng
The Chinese University of Hong Kong, HK SAR, China; Hong Kong Centre for Logistics Robotics, HK SAR, China
C
Chi-Wing Fu
The Chinese University of Hong Kong, HK SAR, China; Hong Kong Centre for Logistics Robotics, HK SAR, China