EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing computer vision research on face-to-face teaching interactions in co-located physical spaces is hindered by data scarcity and insufficient analytical methodologies. Method: We introduce the first first-person, multimodal instructional video dataset—capturing speech, visual cues, and behavioral signals—and establish a novel multimodal benchmark for co-located teaching through fine-grained, dual-dimensional annotations: procedural step segmentation and dialogue state classification. We propose an end-to-end multimodal large language model (MLLM) framework that jointly processes image, audio, and text inputs, and comparatively evaluate it against task-specific models. Results: Our experiments demonstrate that zero-shot MLLMs significantly outperform specialized models on core tasks—without fine-tuning—validating their strong generalization capability for holistic understanding of pedagogical processes. This work establishes a new paradigm for analyzing educational interactions in shared physical environments.

Technology Category

Application Category

📝 Abstract
Analyzing instructional interactions between an instructor and a learner who are co-present in the same physical space is a critical problem for educational support and skill transfer. Yet such face-to-face instructional scenes have not been systematically studied in computer vision. We identify two key reasons: i) the lack of suitable datasets and ii) limited analytical techniques. To address this gap, we present a new egocentric video dataset of face-to-face instruction and provide ground-truth annotations for two fundamental tasks that serve as a first step toward a comprehensive understanding of instructional interactions: procedural step segmentation and conversation-state classification. Using this dataset, we benchmark multimodal large language models (MLLMs) against conventional task-specific models. Since face-to-face instruction involves multiple modalities (speech content and prosody, gaze and body motion, and visual context), effective understanding requires methods that handle verbal and nonverbal communication in an integrated manner. Accordingly, we evaluate recently introduced MLLMs that jointly process images, audio, and text. This evaluation quantifies the extent to which current machine learning models understand face-to-face instructional scenes. In experiments, MLLMs outperform specialized baselines even without task-specific fine-tuning, suggesting their promise for holistic understanding of instructional interactions.
Problem

Research questions and friction points this paper is trying to address.

Analyzing face-to-face instructional interactions in computer vision
Addressing dataset scarcity for egocentric instructional video analysis
Benchmarking multimodal models for holistic instructional understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces egocentric video dataset for instructional interactions
Benchmarks multimodal LLMs against task-specific models
Integrates verbal and nonverbal communication modalities
🔎 Similar Papers
No similar papers found.