🤖 AI Summary
This study investigates the impact of speech representation choices on voice-driven 3D facial animation, aiming to balance semantic fidelity and facial motion reconstruction accuracy. The authors systematically evaluate four categories of discrete speech representations—self-supervised features, neural codec latent spaces, ASR-derived labels, and discrete speech tokens—using both objective metrics and subjective perceptual assessments. Through probing analyses, they uncover how these representations relate to phonemes and articulatory deformations. Their findings reveal that semantic- and label-based representations perform comparably and outperform alternatives, and they demonstrate for the first time that explicit phoneme category encoding is crucial for high-quality animation. Building on these insights, the paper proposes a unified audiovisual text-to-speech (AVTTS) framework that leverages a shared discrete representation to jointly synthesize high-fidelity speech and realistic 3D facial movements.
📝 Abstract
The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.