Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

๐Ÿ“… 2025-12-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Accurately predicting the behavior of multiple pedestrians in dense crowd environments remains a critical challenge for mobile robots, yet existing methods predominantly adopt an egocentric, single-agent perspective and lack the capacity to model third-person multi-agent behaviors and their interactions with the scene. Method: We propose CAMP-VLM, the first context-aware multi-agent behavior prediction framework, which innovatively integrates vision-language models with scene-graph-based spatial reasoning. We further introduce the first synthetic data generation and evaluation paradigm specifically designed for third-person multi-agent behavior prediction. Contribution/Results: Leveraging photorealistic simulation, supervised fine-tuning (SFT), and direct preference optimization (DPO), CAMP-VLM achieves significant improvements over state-of-the-art baselinesโ€”up to 66.9%โ€”on both synthetic and real-world sequences, demonstrating strong generalization and practical applicability.

Technology Category

Application Category

๐Ÿ“ Abstract
Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.
Problem

Research questions and friction points this paper is trying to address.

Predicts multi-human behaviors from third-person view for robots
Uses vision-language models with scene context for interaction prediction
Addresses dataset gap via synthetic data fine-tuning and evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Language Model with scene graphs for multi-human prediction
Fine-tuning using synthetic data from photorealistic simulator
Supervised Fine-Tuning and Direct Preference Optimization for accuracy
๐Ÿ”Ž Similar Papers
No similar papers found.