MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of fine-tuning pre-trained Vision Transformers (ViTs) for 3D Gaussian Splatting rendering under sparse-view settings, this paper proposes a lightweight multi-scale adaptation framework. Methodologically, it introduces low-parameter multi-scale adapters and a cross-view geometric consistency-aware feature fusion aggregator, enabling efficient frozen fine-tuning of the ViT backbone without requiring pose estimation or explicit camera calibration. The key contributions are: (i) drastic reduction in trainable parameters (≤1% of ViT parameters) and GPU memory consumption, while preserving high-fidelity novel view synthesis; and (ii) state-of-the-art rendering accuracy on multiple benchmark datasets, with 3–5× higher training efficiency compared to prevailing methods.

Technology Category

Application Category

📝 Abstract
Sparse-view 3D Gaussian splatting seeks to render high-quality novel views of 3D scenes from a limited set of input images. While recent pose-free feed-forward methods leveraging pre-trained 3D priors have achieved impressive results, most of them rely on full fine-tuning of large Vision Transformer (ViT) backbones and incur substantial GPU costs. In this work, we introduce MuSASplat, a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models with little compromise of rendering quality. Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters. This design avoids the prohibitive GPU overhead associated with previous full-model adaptation techniques while maintaining high fidelity in novel view synthesis, even with very sparse input views. In addition, we introduce a Feature Fusion Aggregator that integrates features across input views effectively and efficiently. Unlike widely adopted memory banks, the Feature Fusion Aggregator ensures consistent geometric integration across input views and meanwhile mitigates the memory usage, training complexity, and computational costs significantly. Extensive experiments across diverse datasets show that MuSASplat achieves state-of-the-art rendering quality but has significantly reduced parameters and training resource requirements as compared with existing methods.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of sparse-view 3D Gaussian splatting training.
Enables efficient fine-tuning with lightweight multi-scale adapter design.
Improves feature integration across views while lowering memory usage.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Multi-Scale Adapter for efficient ViT fine-tuning
Feature Fusion Aggregator for consistent geometric integration
Reduced parameters and GPU costs while maintaining high quality
🔎 Similar Papers
No similar papers found.
M
Muyu Xu
Nanyang Technological University, Singapore
Fangneng Zhan
Fangneng Zhan
MIT
Neural RenderingGenerative Models
X
Xiaoqin Zhang
Zhejiang University of Technology, China
L
Ling Shao
UCAS-Terminus AI Lab, University of Chinese Academy of Sciences, China
Shijian Lu
Shijian Lu
College of Computing and Data Science, NTU
Image and video analyticscomputer visionmachine learning