Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting

📅 2024-10-01

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address catastrophic forgetting in large language models (LLMs) during continual learning for video question answering (Video QA), this paper proposes ColPro, a collaborative prompting framework. ColPro is the first to systematically model the synergistic interaction among textual understanding, visual perception, and temporal dynamics. It integrates three complementary prompt modules—question-constrained prompting, knowledge acquisition prompting, and vision-temporal awareness prompting—within a multi-stage prompting pipeline. The framework further incorporates frame-level visual feature encoding, temporal attention mechanisms, and dedicated continual learning strategies to mitigate forgetting. Evaluated on NExT-QA and DramaQA, ColPro achieves state-of-the-art accuracy of 55.14% and 71.24%, respectively—outperforming existing methods by significant margins. This work establishes a novel paradigm for continual LLM-based reasoning over evolving video content.

Technology Category

Application Category

📝 Abstract

In recent years, the rapid increase in online video content has underscored the limitations of static Video Question Answering (VideoQA) models trained on fixed datasets, as they struggle to adapt to new questions or tasks posed by newly available content. In this paper, we explore the novel challenge of VideoQA within a continual learning framework, and empirically identify a critical issue: fine-tuning a large language model (LLM) for a sequence of tasks often results in catastrophic forgetting. To address this, we propose Collaborative Prompting (ColPro), which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting. These prompts aim to capture textual question context, visual content, and video temporal dynamics in VideoQA, a perspective underexplored in prior research. Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to existing approaches, achieving 55.14% accuracy on NExT-QA and 71.24% accuracy on DramaQA, highlighting its practical relevance and effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Continual Learning

Language Models

Video Question Answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative Prompts

Continual Learning

Video Question Answering

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs