🤖 AI Summary
To address the challenges of high computational overhead and the difficulty in balancing robustness and efficiency for on-device, real-time, high-fidelity Codec Avatar driving in AR/VR, this paper proposes AVE-NAS—a neural architecture search framework—and LATEX, a temporal redundancy-aware frame-skipping mechanism. For the first time, we exploit the linear property of decoder latent space to enable adaptive implicit extrapolation, enhancing robustness under extreme facial expressions while improving hardware compatibility. Our method achieves 5.05× on-device inference acceleration on Meta Quest 2, with animation quality matching or surpassing state-of-the-art methods—entirely without cloud assistance or dedicated co-processors. This work establishes a deployable paradigm for lightweight, high-fidelity virtual human real-time interaction.
📝 Abstract
Real-time and robust photorealistic avatars for telepresence in AR/VR have been highly desired for enabling im-mersive photorealistic telepresence. However, there still exists one key bottleneck: the considerable computational expense needed to accurately infer facial expressions captured from headset-mounted cameras with a quality level that can match the realism of the avatar's human appearance. To this end, we propose a framework called Auto-CARD, which for the first time enables realtime and robust driving of Codec Avatars when exclusively using merely on-device computing resources. This is achieved by minimizing two sources of redundancy. First, we develop a dedicated neural architecture search technique called AVE-NAS for avatar encoding in AR/VR, which explicitly boosts both the searched architectures' robustness in the presence of extreme facial ex-pressions and hardware friendliness on fast evolving AR/VR headsets. Second, we leverage the temporal redundancy in consecutively captured images during continuous rendering and develop a mechanism dubbed LATEX to skip the computation of redundant frames. Specifically, we first identify an opportunity from the linearity of the latent space derived by the avatar decoder and then propose to perform adaptive latent extrapolation for redundant frames. For evaluation, we demonstrate the efficacy of our Auto-CARD framework in realtime Codec Avatar driving settings, where we achieve a $5.05 imes$ speedup on Meta Quest 2 while maintaining a compa-rable or even better animation quality than state-of-the-art avatar encoder designs.