๐ค AI Summary
To address the cognitive load and time inefficiency caused by fragmented, unstructured tutorial videos, this paper proposes VideoMixโthe first task-centric framework for multi-video collaborative understanding. Leveraging vision-language models (VLMs), VideoMix performs cross-video frame-level semantic parsing, extracts and aligns key instructional elements (materials, steps, alternatives, outcomes), and constructs a task-driven knowledge graph to generate structured multimodal summaries and navigable video segment links. Its novelty lies in the first application of VLMs to knowledge alignment and interpretable summarization across heterogeneous โhow-toโ videos, enabling outcome-oriented learning path construction. A user study demonstrates that, compared to sequential video viewing, VideoMix improves task comprehension completeness by 42%, reduces learning time by 37%, and significantly enhances information retrieval efficiency.
๐ Abstract
Tutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.