🤖 AI Summary
This work addresses the computational and memory inefficiencies of existing offline methods for long image sequences, which suffer from rapidly increasing runtime and GPU memory consumption due to repeated global processing of historical frames, thereby hindering online deployment. To overcome this, we propose a strictly causal and incremental 3D Gaussian semantic field framework that introduces, for the first time, a geometry–semantics decoupled dual-backbone architecture. Our approach integrates query-driven decoding, causal Gaussian updates, lightweight instance memory, and query-level contrastive alignment to enable online joint reconstruction and semantic understanding without relying on future frames. The method matches or exceeds strong offline baselines in performance while stably handling sequences exceeding 1,000 frames, with significantly slower growth in runtime and memory usage—unlike typical offline approaches, which often exhaust GPU memory around 80 frames.
📝 Abstract
Existing offline feed-forward methods for joint scene understanding and reconstruction on long image streams often repeatedly perform global computation over an ever-growing set of past observations, causing runtime and GPU memory to increase rapidly with sequence length and limiting scalability. We propose Streaming Semantic Gaussian Splatting (S2GS), a strictly causal, incremental 3D Gaussian semantic field framework: it does not leverage future frames and continuously updates scene geometry, appearance, and instance-level semantics without reprocessing historical frames, enabling scalable online joint reconstruction and understanding. S2GS adopts a geometry-semantic decoupled dual-backbone design: the geometry branch performs causal modeling to drive incremental Gaussian updates, while the semantic branch leverages a 2D foundation vision model and a query-driven decoder to predict segmentation masks and identity embeddings, further stabilized by query-level contrastive alignment and lightweight online association with an instance memory. Experiments show that S2GS matches or outperforms strong offline baselines on joint reconstruction-and-understanding benchmarks, while significantly improving long-horizon scalability: it processes 1,000+ frames with much slower growth in runtime and GPU memory, whereas offline global-processing baselines typically run out of memory at around 80 frames under the same setting.