From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the impact of speech representation choices on voice-driven 3D facial animation, aiming to balance semantic fidelity and facial motion reconstruction accuracy. The authors systematically evaluate four categories of discrete speech representations—self-supervised features, neural codec latent spaces, ASR-derived labels, and discrete speech tokens—using both objective metrics and subjective perceptual assessments. Through probing analyses, they uncover how these representations relate to phonemes and articulatory deformations. Their findings reveal that semantic- and label-based representations perform comparably and outperform alternatives, and they demonstrate for the first time that explicit phoneme category encoding is crucial for high-quality animation. Building on these insights, the paper proposes a unified audiovisual text-to-speech (AVTTS) framework that leverages a shared discrete representation to jointly synthesize high-fidelity speech and realistic 3D facial movements.

📝 Abstract

The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.

Problem

Research questions and friction points this paper is trying to address.

speech representation

3D facial animation

discrete representations

facial synthesis

phonetic units

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete speech representations

3D facial animation

phonetic units