A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study addresses a significant positional bias exhibited by multimodal large language models (MLLMs) in video summarization tasks when processing multiple video inputs: summaries of identical content vary substantially depending on input order. The work presents the first systematic quantification of this phenomenon, introducing a diverse benchmark dataset and three novel evaluation metrics—coverage, directional positional bias, and center-edge gap—to comprehensively assess nine state-of-the-art models. Findings reveal that positional bias is both domain- and model-dependent, persists despite increased visual or generative resources, and manifests as consistently lower summary quality for videos placed in middle positions compared to edge positions, with directional bias misaligned from overall performance trends. Further experiments demonstrate that prompt engineering can partially mitigate but not eliminate this issue.

📝 Abstract

Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

Problem

Research questions and friction points this paper is trying to address.

positional bias

multi-video summarization

multimodal large language models

input order sensitivity

video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

positional bias

multi-video summarization

multimodal large language models