🤖 AI Summary
This study addresses the lack of high-quality multimodal dialogue datasets for recognizing and analyzing interactions between embodied AI agents and humans. To bridge this gap, the authors introduce DeepSpeak-Agentic, a novel dataset comprising over 37 hours of synchronized audiovisual recordings from semi-structured human–agent conversations, establishing the first multimodal dialogue benchmark specifically designed for embodied AI agents. The work presents a scalable data collection framework that integrates AI agent deployment, crowdworker pairing, automated recording, and signal separation techniques to distinguish human and agent contributions. DeepSpeak-Agentic provides a public, standardized evaluation foundation for research in AI-generated speech and facial animation, human–agent interaction modeling, and automated forensic identification tasks.
📝 Abstract
We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.