NeMo: Needle in a Montage for Video-Language Understanding

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Video Large Language Models (VideoLLMs) lack benchmarks capable of rigorously evaluating their long-horizon complex reasoning capabilities—particularly long-context recall and precise temporal localization. To address this gap, we propose “Needles in Montage” (NeMo), the first adaptation of the “needle-in-a-haystack” paradigm to video-language understanding. NeMo introduces a scalable, automated data synthesis framework that generates high-quality, temporally grounded question-answer (QA) pairs for multi-duration videos. Leveraging this framework, we construct and publicly release NeMoBench, a large-scale video QA benchmark comprising 31,378 QA instances. Comprehensive evaluation across 20 state-of-the-art VideoLLMs reveals substantial limitations in temporal reasoning—especially under extended contexts—highlighting critical bottlenecks in current architectures. NeMoBench establishes a reproducible, extensible, and continuously updatable standard evaluation platform to advance research in video-language reasoning.

Technology Category

Application Category

📝 Abstract
Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating complex temporal reasoning in video-language models
Assessing long-context recall and temporal grounding capabilities
Automating scalable video QA data generation for benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline generates video QA data
Needle in a Montage tests temporal reasoning
NeMoBench benchmark evaluates 20 VideoLLMs
🔎 Similar Papers
No similar papers found.
Zi-Yuan Hu
Zi-Yuan Hu
The Chinese University of Hong Kong
Multimodal LearningNatural Language ProcessingParameter-Efficient Tuning
S
Shuo Liang
The Chinese University of Hong Kong
Duo Zheng
Duo Zheng
The Chinese University of Hong Kong
Computer Vision
Yanyang Li
Yanyang Li
The Chinese University of Hong Kong
Natural Language Processing
Y
Yeyao Tao
The Chinese University of Hong Kong
Shijia Huang
Shijia Huang
The Chinese University of Hong Kong
W
Wei Feng
Phoenix TV
J
Jia Qin
Phoenix TV
J
Jianguang Yu
Phoenix TV
J
Jing Huang
Stanford University
Meng Fang
Meng Fang
University of Liverpool
Natural Language ProcessingReinforcement LearningAgentsArtificial intelligence
Y
Yin Li
University of Wisconsin-Madison
L
Liwei Wang
The Chinese University of Hong Kong