🤖 AI Summary
This work addresses the challenge that existing video-language models struggle to recognize and correct user errors in real time during procedural tasks such as cooking. To bridge this gap, the authors introduce Ego-MC-Bench, the first benchmark specifically designed for evaluating real-time task-guided error correction, along with Ego-CoMist, a synthetic dataset generated via counterfactual augmentation that transforms standard instructional videos into training samples containing timely corrective interventions. A lightweight video-language model fine-tuned on Ego-CoMist demonstrates significantly improved streaming error-correction performance on Ego-MC-Bench and is suitable for deployment on edge devices. This contribution fills a critical void in both high-quality training data and standardized evaluation methodologies for real-time interactive guidance systems.
📝 Abstract
Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.