🤖 AI Summary
Addressing the challenge of surface material reconstruction and classification under extremely sparse visual cues, this paper proposes SMARC—a unified framework that, for the first time, simultaneously achieves RGB material reconstruction and material category recognition using only 10% of contiguous local image regions. The method employs a partial-convolution U-Net backbone coupled with a lightweight classification head, enabling end-to-end joint optimization of spatial inpainting and semantic understanding. Evaluated on the real-world Touch and Go texture dataset, SMARC achieves a PSNR of 17.55 dB and a classification accuracy of 85.10%, significantly outperforming five state-of-the-art models—including ViT, MAE, and Swin Transformer. This work establishes the first efficient and robust single-stage solution for material understanding under severely occluded or restricted-view conditions, with direct implications for robotic perception and simulation-based interactive systems.
📝 Abstract
Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.