VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of delivering context-aware, conversational guidance in multi-step instructional videos by jointly leveraging visual content, language, and task planning. The authors propose a multimodal dialogue model that integrates visual inputs, explicit task plans, and user interactions to enable plan-oriented dialogue generation. Key innovations include a multimodal plan reasoning mechanism and a plan-based cross-modal retrieval framework, which together achieve precise alignment between hybrid text-image queries and underlying task structures—an advancement not previously realized. Evaluated on a newly curated instructional video dialogue dataset, the model attains over 90% accuracy on plan-aware visual question answering, substantially outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

Problem

Research questions and friction points this paper is trying to address.

instructional video

multimodal dialogue

plan reasoning

video guidance

task planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal dialogue

plan-aware reasoning

instructional video understanding