ZipSplat: Fewer Gaussians, Better Splats

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses the inefficiency of existing feed-forward 3D Gaussian splatting methods, which rigidly tie the number of Gaussians to image resolution and fail to adapt to scene complexity. The authors propose a novel vision-token-based feed-forward 3D Gaussian reconstruction framework that decouples Gaussian generation from the pixel grid for the first time. Their approach extracts dense tokens via a multi-view backbone, compresses them into compact scene tokens using k-means clustering, and refines these tokens through cross- and self-attention mechanisms before decoding them with a lightweight MLP into Gaussians with free 3D positions. Requiring neither camera poses nor intrinsics, the method enables flexible control over Gaussian count at inference via clustering, allowing a single model to span the quality–efficiency trade-off curve. It achieves state-of-the-art performance on DL3DV and RealEstate10K, surpassing the best pose-free baselines by 2.1 dB and 1.2 dB in PSNR with roughly six times fewer Gaussians, and demonstrates strong zero-shot generalization to Mip-NeRF360 and ScanNet++.
📝 Abstract
Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured object thus produce equally many Gaussians despite very different geometric needs. We propose ZipSplat, a token-based feed-forward model that decouples Gaussian placement from the pixel grid. A multi-view backbone extracts dense visual tokens, and k-means clustering compresses them into a compact set of scene tokens. Cross- and self-attention refine these tokens, and a lightweight MLP decodes each into a group of Gaussians with unconstrained 3D positions. Because clustering is applied at inference, a single trained model spans the quality-efficiency curve without retraining. ZipSplat operates without ground-truth poses or intrinsics, yet sets a new state of the art on DL3DV and RealEstate10K with ${\sim}6{\times}$ fewer Gaussians than pixel-aligned methods, surpassing the best pose-free baseline by 2.1dB and 1.2dB PSNR, respectively. It further generalizes zero-shot to Mip-NeRF360 and ScanNet++, outperforming all comparable baselines. Our project page is at ${\href{https://veichta.com/zipsplat}{https://veichta.com/zipsplat}}$.
Problem

Research questions and friction points this paper is trying to address.

3D Gaussian Splatting
scene representation
representation efficiency
pixel-aligned reconstruction
geometric complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian Splatting
token-based representation
pose-free 3D reconstruction
k-means clustering
zero-shot generalization
🔎 Similar Papers
No similar papers found.