🤖 AI Summary
This work addresses the challenge of delivering context-aware, conversational guidance in multi-step instructional videos by jointly leveraging visual content, language, and task planning. The authors propose a multimodal dialogue model that integrates visual inputs, explicit task plans, and user interactions to enable plan-oriented dialogue generation. Key innovations include a multimodal plan reasoning mechanism and a plan-based cross-modal retrieval framework, which together achieve precise alignment between hybrid text-image queries and underlying task structures—an advancement not previously realized. Evaluated on a newly curated instructional video dialogue dataset, the model attains over 90% accuracy on plan-aware visual question answering, substantially outperforming current state-of-the-art approaches.
📝 Abstract
We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.