VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

📅 2024-03-21
🏛️ arXiv.org
📈 Citations: 5
Influential: 1
📄 PDF
🤖 AI Summary
To address the limitations of large language models (LLMs) in video understanding—namely, constrained reasoning capabilities and heavy reliance on large-scale annotated data—this paper proposes the Video Understanding Reasoning and Self-Refinement Framework (VURF), a general-purpose framework for video understanding. VURF introduces an LLM-driven visual programming paradigm: it automatically decomposes complex video tasks into executable visual programs and enables in-context learning via instruction-program pairing. Crucially, it incorporates a function-level error feedback correction mechanism and an iterative, context-aware example self-refinement strategy, substantially reducing dependence on human annotations. Evaluated across diverse tasks—including video question answering, video prediction, pose estimation, and multi-video question answering—VURF consistently outperforms existing visual programming approaches, demonstrating superior generalization and robustness.

Technology Category

Application Category

📝 Abstract
Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as reasoning modules that can deconstruct complex tasks into more manageable sub-tasks, particularly when applied to visual reasoning tasks for images. In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs. Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework. We harness their contextual learning capabilities by presenting LLMs with pairs of instructions and their corresponding high-level programs to generate executable visual programs for video understanding. To enhance the program's accuracy and robustness, we implement two important strategies. emph{Firstly,} we employ a feedback-generation approach, powered by GPT-3.5, to rectify errors in programs utilizing unsupported functions. emph{Secondly}, taking motivation from recent works on self-refinement of LLM outputs, we introduce an iterative procedure for improving the quality of the in-context examples by aligning the initial outputs to the outputs that would have been generated had the LLM not been bound by the structure of the in-context examples. Our results on several video-specific tasks, including visual QA, video anticipation, pose estimation, and multi-video QA, illustrate these enhancements' efficacy in improving the performance of visual programming approaches for video tasks.
Problem

Research questions and friction points this paper is trying to address.

Extends LLMs for video understanding tasks
Generates executable visual programs from instructions
Improves program accuracy via feedback and self-refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends LLMs for video understanding tasks
Uses GPT-3.5 for error correction in programs
Implements iterative self-refinement for example quality
🔎 Similar Papers
No similar papers found.
A
Ahmad Mahmood
ETH Zurich
Ashmal Vayani
Ashmal Vayani
University of Central Florida
Computer VisionMultiModalityLarge Language ModelsResponsible AI
Muzammal Naseer
Muzammal Naseer
Asst. Professor, Khalifa University
Multi-modal LearningAI Safety and Reliability
S
Salman H. Khan
Mohamed bin Zayed University of Artificial Intelligence, Australian National University
F
F. Khan
Mohamed bin Zayed University of Artificial Intelligence, Linköping University