🤖 AI Summary
Existing VideoQA datasets exhibit threefold biases—geographic, viewpoint (e.g., CCTV, handheld, drone), and expert-centric annotation paradigms—that hinder modeling of diverse, user-generated narratives prevalent in global social media. To address this, we introduce RoadSocial, the first multi-source VideoQA benchmark specifically designed for social-media road events. It spans 23 countries and encompasses 12 complex question-answering tasks. We propose a novel social-comment-driven semi-automatic annotation framework that synergistically leverages Text and Video LLMs to generate high-quality QA pairs. The benchmark comprises 13.2K videos, 260K QA pairs, and 674 fine-grained semantic labels. We conduct systematic evaluation across 18 state-of-the-art Video LLMs, demonstrating significant improvements in cross-regional and cross-viewpoint generalization, as well as semantic fidelity in road-event understanding.
📝 Abstract
We introduce RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning 14M frames and 414K social comments, resulting in a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs. We evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose) on our road event understanding benchmark. We also demonstrate RoadSocial's utility in improving road event understanding capabilities of general-purpose Video LLMs.