Problem
Research questions and friction points this paper is trying to address.
Integrating visual cues for expressive speech generation
Exploring effective multimodal fusion strategies
Improving emotion recognition and expressive dialogue performance
Innovation
Methods, ideas, or system contributions that make the work stand out.
Integrates full-face visual cues into speech model
Explores visual encoders and multimodal fusion strategies
Fine-tunes on emotion recognition and dialogue tasks