ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing speech-driven portrait animation methods struggle to simultaneously achieve high lip-sync accuracy and naturalistic facial expressions and head movements. To address this challenge, this work proposes ReFree-S2V, a framework built upon flow-matching video generation models that integrates multi-granularity speech representations—encompassing both phonetic and prosodic features—and employs a learnable hierarchical guidance mechanism. Furthermore, it incorporates a reinforcement learning strategy that operates without manual reward signals, enabling joint optimization of lip-sync fidelity and overall motion naturalness without reliance on handcrafted synchronization metrics or human preference annotations. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art approaches in both objective lip-sync accuracy and subjective evaluations of naturalness and expressiveness.

📝 Abstract

Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurate lip articulation and expressive behavior remains challenging. Existing approaches typically trade off precise phoneme-to-lip synchronization against dynamic facial expressions and head motion, yielding animations that are either accurate yet rigid, or expressive but poorly synchronized. We address this challenge by proposing ReFree-S2V, a flow-matching speech-to-portrait animation framework that builds upon a pretrained video generation model to achieve fine-grained speech articulation and high-level expressive cues in speech-driven portrait animation. This model introduces a multi-level speech representation capturing phonetic and prosodic information at both local and global granularities. These representations are selectively injected into transformer blocks via learnable level selectors, enabling both accurate lip synchronization and natural expressive motion. To achieve natural head movements, we further introduce a novel reward-free reinforcement learning scheme into flow-matching training to discourage perceptually implausible motion without relying on handcrafted synchronization metrics or reward models, or the high cost of human preference annotation. Extensive experiments demonstrate that ReFree-S2V achieves state-of-the-art performance, significantly outperforming existing methods in both quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.

Problem

Research questions and friction points this paper is trying to address.

co-speech video generation

lip synchronization

expressive facial animation

speech-driven animation

realistic portrait video

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward-free reinforcement learning

multilevel speech representation

flow-matching