🤖 AI Summary
Existing multi-view 3D face reconstruction methods suffer from high memory consumption, limited scalability to dense topologies, and susceptibility to surface noise due to per-vertex independent optimization. This work proposes SHELLS, a novel framework that introduces coarse-mesh-guided hierarchical surface sampling for the first time in this task. By leveraging a DINOv2 backbone with LoRA adapters, SHELLS extracts a sparse global feature cloud and constructs surface-oriented sampling shells conditioned on a coarse mesh, effectively decoupling feature extraction from mesh resolution. Trained exclusively on synthetic data, the method generalizes robustly to real-world scenarios without requiring costly pre-aligned datasets. Compared to voxel-based baselines, SHELLS reduces inference GPU memory usage by 88% (2.4 GB vs. 20 GB), achieves a 3.5× speedup for 18k-vertex meshes (0.08 s vs. 0.29 s), and lowers median registration error by 21%–29%.
📝 Abstract
We present SHELLS (Semantic Head Estimation via Layered Local Sampling), an efficient feed-forward framework for 3D head reconstruction in dense semantic correspondence from multi-view images. Existing methods typically refine vertices independently via localized feature volumes. This approach couples memory-intensive feature sampling to mesh resolution, which limits scalability for dense topologies (> 10k vertices) and introduces surface noise. In contrast, SHELLS decouples feature extraction from mesh resolution via a hierarchical sampling strategy. We extract multi-view features using a DINOv2 backbone with LoRA adaptation, projectively sample a sparse global feature cloud, and predict an intermediate coarse mesh. This coarse prior guides the construction of layered, surface-aware sampling shells that serve as a discrete search space for the final reconstruction. SHELLS maintains surface consistency while using 88% less inference GPU memory (2.4GB vs. 20GB) than volumetric baselines. It reduces median registration error by 21% to 29% with a 3.5x inference speedup (0.08s vs. 0.29s) for 18k-vertex meshes. Notably, our model is trained exclusively on synthetic data yet generalizes effectively to real-world captures, eliminating the need for the costly, pre-registered multi-view datasets common in prior work.