VIBE: Video-to-Text Information Bottleneck Evaluation for TL;DR

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models (VLMs) generate verbose, inefficient video summaries that degrade downstream task performance and inflate human supervision costs; moreover, mainstream evaluation relies on costly manual annotations, neglecting summary utility in real-world tasks. This paper proposes the first annotation-free, task-aware information bottleneck framework for video summarization evaluation—introducing the information bottleneck principle to video-language assessment to jointly optimize content fidelity and downstream decision utility. Our method integrates vision-language alignment modeling, task-oriented utility scoring, randomized sampling-based ranking, and a cross-domain human evaluation protocol. Evaluated on three real-world scenario datasets, our framework improves task accuracy by 61.23% and reduces response latency by 75.77% over raw VLM outputs—outperforming even human video browsing.

Technology Category

Application Category

📝 Abstract

Many decision-making tasks, where both accuracy and efficiency matter, still require human supervision. For example, tasks like traffic officers reviewing hour-long dashcam footage or researchers screening conference videos can benefit from concise summaries that reduce cognitive load and save time. Yet current vision-language models (VLMs) often produce verbose, redundant outputs that hinder task performance. Existing video caption evaluation depends on costly human annotations and overlooks the summaries' utility in downstream tasks. We address these gaps with Video-to-text Information Bottleneck Evaluation (VIBE), an annotation-free method that scores VLM outputs using two metrics: grounding (how well the summary aligns with visual content) and utility (how informative it is for the task). VIBE selects from randomly sampled VLM outputs by ranking them according to the two scores to support effective human decision-making. Human studies on LearningPaper24, SUTD-TrafficQA, and LongVideoBench show that summaries selected by VIBE consistently improve performance-boosting task accuracy by up to 61.23% and reducing response time by 75.77% compared to naive VLM summaries or raw video.

Problem

Research questions and friction points this paper is trying to address.

Evaluates video summaries for accuracy and efficiency in tasks

Addresses verbose outputs from vision-language models in summaries

Measures summary utility and grounding without human annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotation-free video summary evaluation method

Scores summaries by grounding and utility metrics

Improves task accuracy and reduces response time

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Authors to Follow