HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenge of single-image 3D human reconstruction, which is hindered by the scarcity of high-quality, diverse 3D training data. The authors propose a scalable data generation pipeline that combines rigged asset animation with multi-view fitting to construct a large-scale training set. They further introduce a token-conditioned triplane architecture that integrates SMPL priors, token-based encoding, and cross-attention mechanisms to enable feed-forward 3D avatar generation without test-time optimization. The method produces high-fidelity, generalizable 3D human models in under one second, demonstrating superior quantitative and qualitative performance across multiple benchmarks. It supports real-time inference and exhibits robustness to diverse input images.

📝 Abstract

In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions. Project page at https://HumanNOVA.github.io .

Problem

Research questions and friction points this paper is trying to address.

3D human avatar

photorealistic modeling

single-image reconstruction

data scarcity

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

photorealistic 3D avatar

single-image reconstruction

token-conditioned modeling