RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning

📅 2024-05-11

📈 Citations: 2

✨ Influential: 1

career value

163K/year

🤖 AI Summary

For zero-shot video captioning—i.e., generating descriptive captions without any training data or model fine-tuning—this paper proposes a test-time adaptation (TTA) framework. The method synergistically leverages frozen multimodal models—XCLIP (video–text), CLIP (image–text), AnglE (text embedding), and GPT-2 (text generation)—augmented by learnable, retrievable tokens. Its core contribution is the first retrieval-augmented TTA mechanism for video captioning, integrating unsupervised, soft-target-guided token optimization that achieves video-aware alignment in just 16 optimization steps. Crucially, no labeled data or parameter updates are required. Evaluated on MSR-VTT, MSVD, and VATEX, the approach advances the state-of-the-art zero-shot CIDEr scores by 5.1%–32.4% over prior methods, demonstrating significant gains in caption quality while preserving full zero-shot operability.

Technology Category

Application Category

📝 Abstract

Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This procedure can be efficiently done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show absolute 5.1%-32.4% improvements in terms of the main metric CIDEr compared to several state-of-the-art zero-shot video captioning methods.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot Learning

Video Captioning

Computer Vision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot Video Captioning

Transfer Learning

Large Language Models Integration

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs