🤖 AI Summary
This study addresses the challenge of applying turn-taking prediction models—originally designed for spoken dialogue—to sign language interaction, where direct adaptation is hindered by modality differences. For the first time, the Voice Activity Projection (VAP) framework is adapted to dyadic sign language conversations using the DGS corpus. Binary signing activity streams are constructed, and pose estimation is employed to extract features from the hands, eyes, and mouth. A temporal model is then trained to predict turn-taking states (SHIFT/HOLD). The work introduces proxy tasks and activity definitions tailored to the visual-gestural nature of sign language, demonstrating the feasibility of cross-modal transfer: the model performs well on HOLD/SHIFT classification, with hand-related cues proving particularly informative. However, predicting pure SHIFT events remains challenging, underscoring the need for a turn-taking annotation framework specifically designed for sign language.
📝 Abstract
Social robots must interact robustly not only with users assumed by speech-centered systems but also with diverse users whose communication relies on different modalities, e.g., sign language. One important capability gap is predictive turn-taking with signing users. Although Voice Activity Projection (VAP) has been successfully used to model future voice activity in spoken interaction, it remains unclear whether the framework transfers to sign language interaction. This paper presents an initial transfer study of adapting a VAP architecture to dyadic sign language interaction. Using interaction recordings from the Public DGS Corpus, we derive binary signing activity streams from lexical sign annotations and formulate proxy tasks for turn-taking prediction. The model uses pose-derived hand, eye-region, and mouth-region features extracted for each signer. The results show that SHIFT/HOLD prediction is promising, especially with hand cues, while SHIFT-prediction remains difficult. These findings provide initial evidence for both the promise and the current limitations of transferring predictive turn-taking models from spoken interaction to sign language interaction. Predictive modeling of sign language interaction still requires sign-language-specific event definitions that go beyond speech-derived categories.