HRM^2Avatar: High-Fidelity Real-Time Mobile Avatars from Monocular Phone Scans

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of reconstructing high-fidelity, animatable virtual humans from monocular smartphone scans and enabling real-time rendering on mobile devices, this paper proposes a lightweight yet expressive digital human representation. Our method jointly models static geometry and dynamic deformations by integrating clothing-aware mesh extraction with illumination-aware differentiable Gaussians, explicitly capturing pose-dependent deformations and lighting variations. Built upon monocular video-based 3D reconstruction, learned dynamic deformation modeling, and mesh-attached differentiable rendering, we further design a GPU-accelerated mobile rendering pipeline. The system achieves 120 FPS on smartphones and 90 FPS at 2K resolution on VR headsets—outperforming mainstream baselines by over 2.7× in efficiency—while delivering superior visual fidelity and interactivity compared to existing monocular approaches.

Technology Category

Application Category

📝 Abstract
We present HRM$^2$Avatar, a framework for creating high-fidelity avatars from monocular phone scans, which can be rendered and animated in real time on mobile devices. Monocular capture with smartphones provides a low-cost alternative to studio-grade multi-camera rigs, making avatar digitization accessible to non-expert users. Reconstructing high-fidelity avatars from single-view video sequences poses challenges due to limited visual and geometric data. To address these limitations, at the data level, our method leverages two types of data captured with smartphones: static pose sequences for texture reconstruction and dynamic motion sequences for learning pose-dependent deformations and lighting changes. At the representation level, we employ a lightweight yet expressive representation to reconstruct high-fidelity digital humans from sparse monocular data. We extract garment meshes from monocular data to model clothing deformations effectively, and attach illumination-aware Gaussians to the mesh surface, enabling high-fidelity rendering and capturing pose-dependent lighting. This representation efficiently learns high-resolution and dynamic information from monocular data, enabling the creation of detailed avatars. At the rendering level, real-time performance is critical for animating high-fidelity avatars in AR/VR, social gaming, and on-device creation. Our GPU-driven rendering pipeline delivers 120 FPS on mobile devices and 90 FPS on standalone VR devices at 2K resolution, over $2.7 imes$ faster than representative mobile-engine baselines. Experiments show that HRM$^2$Avatar delivers superior visual realism and real-time interactivity, outperforming state-of-the-art monocular methods.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing high-fidelity avatars from monocular phone video sequences
Enabling real-time avatar animation on mobile and VR devices
Addressing limited visual and geometric data from single-view capture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular phone scans create high-fidelity avatars
Garment meshes and illumination-aware Gaussians capture details
GPU-driven pipeline enables 120 FPS mobile rendering
🔎 Similar Papers
No similar papers found.
Chao Shi
Chao Shi
Alibaba Group, China
S
Shenghao Jia
Shanghai Jiao Tong University, China and Alibaba Group, China
Jinhui Liu
Jinhui Liu
Xi'an Jiaotong University
Y
Yong Zhang
Alibaba Group, China
L
Liangchao Zhu
Alibaba Group, China
Z
Zhonglei Yang
Alibaba Group, China
J
Jinze Ma
Alibaba Group, China
Chaoyue Niu
Chaoyue Niu
Shanghai Jiao Tong University
Device-Cloud MLOn-Device Intelligence
C
Chengfei Lv
Alibaba Group, China