Topologically Consistent Multi-view 3D Head Reconstruction via Coarse-Guided Layered Surface Sampling

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing multi-view 3D face reconstruction methods suffer from high memory consumption, limited scalability to dense topologies, and susceptibility to surface noise due to per-vertex independent optimization. This work proposes SHELLS, a novel framework that introduces coarse-mesh-guided hierarchical surface sampling for the first time in this task. By leveraging a DINOv2 backbone with LoRA adapters, SHELLS extracts a sparse global feature cloud and constructs surface-oriented sampling shells conditioned on a coarse mesh, effectively decoupling feature extraction from mesh resolution. Trained exclusively on synthetic data, the method generalizes robustly to real-world scenarios without requiring costly pre-aligned datasets. Compared to voxel-based baselines, SHELLS reduces inference GPU memory usage by 88% (2.4 GB vs. 20 GB), achieves a 3.5× speedup for 18k-vertex meshes (0.08 s vs. 0.29 s), and lowers median registration error by 21%–29%.

📝 Abstract

We present SHELLS (Semantic Head Estimation via Layered Local Sampling), an efficient feed-forward framework for 3D head reconstruction in dense semantic correspondence from multi-view images. Existing methods typically refine vertices independently via localized feature volumes. This approach couples memory-intensive feature sampling to mesh resolution, which limits scalability for dense topologies (> 10k vertices) and introduces surface noise. In contrast, SHELLS decouples feature extraction from mesh resolution via a hierarchical sampling strategy. We extract multi-view features using a DINOv2 backbone with LoRA adaptation, projectively sample a sparse global feature cloud, and predict an intermediate coarse mesh. This coarse prior guides the construction of layered, surface-aware sampling shells that serve as a discrete search space for the final reconstruction. SHELLS maintains surface consistency while using 88% less inference GPU memory (2.4GB vs. 20GB) than volumetric baselines. It reduces median registration error by 21% to 29% with a 3.5x inference speedup (0.08s vs. 0.29s) for 18k-vertex meshes. Notably, our model is trained exclusively on synthetic data yet generalizes effectively to real-world captures, eliminating the need for the costly, pre-registered multi-view datasets common in prior work.

Problem

Research questions and friction points this paper is trying to address.

3D head reconstruction

multi-view images

dense topology

surface consistency

semantic correspondence

Innovation

Methods, ideas, or system contributions that make the work stand out.

layered surface sampling

topologically consistent reconstruction

coarse-guided sampling