🤖 AI Summary
This work addresses the limitation of existing 3D Gaussian Splatting (3DGS) methods, which rely on pixel-level losses that often yield blurry renderings and fail to account for human perceptual quality. To bridge this gap, we present the first large-scale human subjective evaluation study focused on 3DGS and introduce WD-R, a Wasserstein distance–based regularized perceptual loss that can directly replace conventional loss functions. Without increasing the number of Gaussians, WD-R significantly enhances texture detail and overall visual fidelity, achieving state-of-the-art performance on perceptual metrics including LPIPS, DISTS, and FID. Human preference studies show a 1.5–3.6× improvement in perceived quality, and when applied to 3DGS compression, WD-R enables approximately 50% bitrate savings while maintaining visual quality.
📝 Abstract
Despite their output being ultimately consumed by human viewers, 3D Gaussian Splatting (3DGS) methods often rely on ad-hoc combinations of pixel-level losses, resulting in blurry renderings. To address this, we systematically explore perceptual optimization strategies for 3DGS by searching over a diverse set of distortion losses. We conduct the first-of-its-kind large-scale human subjective study on 3DGS, involving 39,320 pairwise ratings across several datasets and 3DGS frameworks. A regularized version of Wasserstein Distortion, which we call WD-R, emerges as the clear winner, excelling at recovering fine textures without incurring a higher splat count. WD-R is preferred by raters more than $2.3\times$ over the original 3DGS loss, and $1.5\times$ over current best method Perceptual-GS. WD-R also consistently achieves state-of-the-art LPIPS, DISTS, and FID scores across various datasets, and generalizes across recent frameworks, such as Mip-Splatting and Scaffold-GS, where replacing the original loss with WD-R consistently enhances perceptual quality within a similar resource budget (number of splats for Mip-Splatting, model size for Scaffold-GS), and leads to reconstructions being preferred by human raters $1.8\times$ and $3.6\times$, respectively. We also find that this carries over to the task of 3DGS scene compression, with $\approx 50\%$ bitrate savings for comparable perceptual metric performance.