Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current speech-driven 3D facial animation models rely on large-scale, high-quality audio-animation paired data and suffer from high computational complexity and inference latency, hindering real-time deployment on resource-constrained edge devices (e.g., gaming consoles). To address this, we propose a lightweight knowledge distillation framework: a minimal student model built exclusively with convolutional and fully connected layers—eliminating attention and recurrent components—and trained via hybrid knowledge distillation alongside teacher-generated pseudo-labels, obviating the need for ground-truth animation annotations. The resulting model occupies only 3.4 MB and requires merely 81 ms of future audio context for inference—significantly lower than prior methods—while preserving high-fidelity animation quality. To our knowledge, this is the first approach enabling real-time, on-device speech-driven facial animation under stringent resource constraints.

Technology Category

Application Category

📝 Abstract

The training of high-quality, robust machine learning models for speech-driven 3D facial animation requires a large, diverse dataset of high-quality audio-animation pairs. To overcome the lack of such a dataset, recent work has introduced large pre-trained speech encoders that are robust to variations in the input audio and, therefore, enable the facial animation model to generalize across speakers, audio quality, and languages. However, the resulting facial animation models are prohibitively large and lend themselves only to offline inference on a dedicated machine. In this work, we explore on-device, real-time facial animation models in the context of game development. We overcome the lack of large datasets by using hybrid knowledge distillation with pseudo-labeling. Given a large audio dataset, we employ a high-performing teacher model to train very small student models. In contrast to the pre-trained speech encoders, our student models only consist of convolutional and fully-connected layers, removing the need for attention context or recurrent updates. In our experiments, we demonstrate that we can reduce the memory footprint to up to 3.4 MB and required future audio context to up to 81 ms while maintaining high-quality animations. This paves the way for on-device inference, an important step towards realistic, model-driven digital characters.

Problem

Research questions and friction points this paper is trying to address.

Lack of large datasets for high-quality facial animation models

Large pre-trained models are not suitable for real-time on-device use

Need for small, efficient models maintaining high-quality animation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid knowledge distillation with pseudo-labeling

Small convolutional and fully-connected student models

Reduced memory footprint and audio context

🔎 Similar Papers

No similar papers found.

Authors to Follow