🤖 AI Summary
To address CUDA kernel load imbalance and low rendering efficiency in 3D Gaussian Splatting (3DGS) training—caused by non-uniform pixel-to-Gaussian assignments—this paper proposes a Gaussian-granularity parallel rendering framework with fine-grained dynamic scheduling. Our method introduces three key innovations: (1) the first Gaussian-level parallel rendering mechanism; (2) SM-level dynamic load mapping coupled with low-divergence intra-warp parallelization; and (3) fine-grained tiling-based scheduling with runtime load-aware adaptive kernel switching. Experiments demonstrate up to a 7.52× speedup in forward-rendering CUDA kernel performance, significantly reducing per-iteration training time. The approach effectively mitigates GPU resource idleness and long-tail latency, providing an efficient, scalable foundation for real-time 3DGS training.
📝 Abstract
3D Gaussian Splatting (3DGS) is increasingly attracting attention in both academia and industry owing to its superior visual quality and rendering speed. However, training a 3DGS model remains a time-intensive task, especially in load imbalance scenarios where workload diversity among pixels and Gaussian spheres causes poor renderCUDA kernel performance. We introduce Balanced 3DGS, a Gaussian-wise parallelism rendering with fine-grained tiling approach in 3DGS training process, perfectly solving load-imbalance issues. First, we innovatively introduce the inter-block dynamic workload distribution technique to map workloads to Streaming Multiprocessor(SM) resources within a single GPU dynamically, which constitutes the foundation of load balancing. Second, we are the first to propose the Gaussian-wise parallel rendering technique to significantly reduce workload divergence inside a warp, which serves as a critical component in addressing load imbalance. Based on the above two methods, we further creatively put forward the fine-grained combined load balancing technique to uniformly distribute workload across all SMs, which boosts the forward renderCUDA kernel performance by up to 7.52x. Besides, we present a self-adaptive render kernel selection strategy during the 3DGS training process based on different load-balance situations, which effectively improves training efficiency.