HRM^2Avatar: High-Fidelity Real-Time Mobile Avatars from Monocular Phone Scans

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenge of reconstructing high-fidelity, animatable virtual humans from monocular smartphone scans and enabling real-time rendering on mobile devices, this paper proposes a lightweight yet expressive digital human representation. Our method jointly models static geometry and dynamic deformations by integrating clothing-aware mesh extraction with illumination-aware differentiable Gaussians, explicitly capturing pose-dependent deformations and lighting variations. Built upon monocular video-based 3D reconstruction, learned dynamic deformation modeling, and mesh-attached differentiable rendering, we further design a GPU-accelerated mobile rendering pipeline. The system achieves 120 FPS on smartphones and 90 FPS at 2K resolution on VR headsets—outperforming mainstream baselines by over 2.7× in efficiency—while delivering superior visual fidelity and interactivity compared to existing monocular approaches.

Technology Category

Application Category

📝 Abstract

We present HRM$^2$Avatar, a framework for creating high-fidelity avatars from monocular phone scans, which can be rendered and animated in real time on mobile devices. Monocular capture with smartphones provides a low-cost alternative to studio-grade multi-camera rigs, making avatar digitization accessible to non-expert users. Reconstructing high-fidelity avatars from single-view video sequences poses challenges due to limited visual and geometric data. To address these limitations, at the data level, our method leverages two types of data captured with smartphones: static pose sequences for texture reconstruction and dynamic motion sequences for learning pose-dependent deformations and lighting changes. At the representation level, we employ a lightweight yet expressive representation to reconstruct high-fidelity digital humans from sparse monocular data. We extract garment meshes from monocular data to model clothing deformations effectively, and attach illumination-aware Gaussians to the mesh surface, enabling high-fidelity rendering and capturing pose-dependent lighting. This representation efficiently learns high-resolution and dynamic information from monocular data, enabling the creation of detailed avatars. At the rendering level, real-time performance is critical for animating high-fidelity avatars in AR/VR, social gaming, and on-device creation. Our GPU-driven rendering pipeline delivers 120 FPS on mobile devices and 90 FPS on standalone VR devices at 2K resolution, over $2.7 imes$ faster than representative mobile-engine baselines. Experiments show that HRM$^2$Avatar delivers superior visual realism and real-time interactivity, outperforming state-of-the-art monocular methods.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing high-fidelity avatars from monocular phone video sequences

Enabling real-time avatar animation on mobile and VR devices

Addressing limited visual and geometric data from single-view capture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular phone scans create high-fidelity avatars

Garment meshes and illumination-aware Gaussians capture details

GPU-driven pipeline enables 120 FPS mobile rendering

🔎 Similar Papers

No similar papers found.

Authors to Follow