🤖 AI Summary
To address the limitations of large language models (LLMs) in video understanding—namely, constrained reasoning capabilities and heavy reliance on large-scale annotated data—this paper proposes the Video Understanding Reasoning and Self-Refinement Framework (VURF), a general-purpose framework for video understanding. VURF introduces an LLM-driven visual programming paradigm: it automatically decomposes complex video tasks into executable visual programs and enables in-context learning via instruction-program pairing. Crucially, it incorporates a function-level error feedback correction mechanism and an iterative, context-aware example self-refinement strategy, substantially reducing dependence on human annotations. Evaluated across diverse tasks—including video question answering, video prediction, pose estimation, and multi-video question answering—VURF consistently outperforms existing visual programming approaches, demonstrating superior generalization and robustness.
📝 Abstract
Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as reasoning modules that can deconstruct complex tasks into more manageable sub-tasks, particularly when applied to visual reasoning tasks for images. In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs. Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework. We harness their contextual learning capabilities by presenting LLMs with pairs of instructions and their corresponding high-level programs to generate executable visual programs for video understanding. To enhance the program's accuracy and robustness, we implement two important strategies. emph{Firstly,} we employ a feedback-generation approach, powered by GPT-3.5, to rectify errors in programs utilizing unsupported functions. emph{Secondly}, taking motivation from recent works on self-refinement of LLM outputs, we introduce an iterative procedure for improving the quality of the in-context examples by aligning the initial outputs to the outputs that would have been generated had the LLM not been bound by the structure of the in-context examples. Our results on several video-specific tasks, including visual QA, video anticipation, pose estimation, and multi-video QA, illustrate these enhancements' efficacy in improving the performance of visual programming approaches for video tasks.