StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of evaluation frameworks for multimodal large language models’ (MLLMs) real-time understanding of continuous video streams in embodied AI. We introduce StreamEQA, the first streaming video question-answering benchmark tailored to embodied scenarios. Built upon 156 long videos, it comprises 42 tasks spanning perception, interaction, and planning—organized along three time-sensitive reasoning modes: forward, real-time, and backward. A hybrid pipeline combining automated generation with human refinement yields ~21K temporally grounded QA pairs with precise timestamps. Evaluations across 13 state-of-the-art video foundation models reveal strong performance on conventional benchmarks but substantial degradation on StreamEQA, exposing fundamental limitations in dynamic environment modeling and cross-temporal embodied reasoning. StreamEQA establishes the first dual-dimensional evaluation framework—incorporating both embodiment and streaming properties—thereby providing a critical benchmark and new research direction for real-world embodied intelligence.

Technology Category

Application Category

📝 Abstract
As embodied intelligence advances toward real-world deployment, the ability to continuously perceive and reason over streaming visual inputs becomes essential. In such settings, an agent must maintain situational awareness of its environment, comprehend the interactions with surrounding entities, and dynamically plan actions informed by past observations, current contexts, and anticipated future events. To facilitate progress in this direction, we introduce StreamEQA, the first benchmark designed for streaming video question answering in embodied scenarios. StreamEQA evaluates existing MLLMs along two orthogonal dimensions: Embodied and Streaming. Along the embodied dimension, we categorize the questions into three levels: perception, interaction, and planning, which progressively assess a model's ability to recognize fine-grained visual details, reason about agent-object interactions, and perform high-level goal-directed reasoning. For the streaming dimension, questions are divided into backward, real-time, and forward reasoning, with each mode relying on a distinct temporal context. Built upon 156 independent long videos, StreamEQA defines 42 tasks and generates approximately 21K question-answer pairs with precise timestamps through a hybrid pipeline combining automated generation and human refinement. Evaluations of 13 state-of-the-art video-LLMs reveal that, despite strong performance on conventional benchmarks, these models still struggle with streaming video understanding in embodied scenarios. We hope StreamEQA will catalyze research on streaming video understanding for embodied applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluates models on streaming video understanding in embodied scenarios
Assesses perception, interaction, and planning in dynamic environments
Tests backward, real-time, and forward reasoning with temporal contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

StreamEQA benchmark for streaming video QA
Categorizes questions by embodied and streaming dimensions
Uses hybrid pipeline for generating question-answer pairs
🔎 Similar Papers
No similar papers found.
Y
Yifei Wang
School of Computer Science and Technology, East China Normal University
Z
Zhenkai Li
School of Computer Science and Technology, East China Normal University
Tianwen Qian
Tianwen Qian
East China Normal University
MultimediaVision and LanguageEmbodied AI
Huanran Zheng
Huanran Zheng
East China Normal University
machine translation
Z
Zheng Wang
Zhejiang University of Technology
Y
Yuqian Fu
INSAIT, Sofia University "St. Kliment Ohridski"
X
Xiaoling Wang
School of Computer Science and Technology, East China Normal University