🤖 AI Summary
This work addresses intrinsic image decomposition under arbitrary numbers of input views and lighting conditions. We propose the first end-to-end, multi-view consistent framework for joint geometry and material estimation. Methodologically, we design a cross-view–cross-domain attention mechanism that integrates diffusion-based priors with physically grounded rendering constraints; introduce illumination augmentation and view-adaptive training strategies; and construct ARB-Objaverse—the first large-scale multi-view, multi-illumination intrinsic dataset. Experiments demonstrate significant improvements over state-of-the-art methods in surface normal and material property estimation. Our approach achieves comprehensive qualitative and quantitative superiority on downstream tasks including single-image relighting, photometric stereo, and 3D reconstruction. It effectively mitigates the longstanding illumination–material ambiguity and multi-view inconsistency challenges.
📝 Abstract
Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency. In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training. Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.