🤖 AI Summary
This work addresses the scarcity of globally representative, temporally continuous, and unedited urban driving data by introducing the CROWD dataset. Curated from YouTube, it comprises over 7,000 front-view dashcam videos spanning 238 countries across six continents, carefully filtered to exclude accidents and edited content, thereby offering minute-level continuous raw footage of everyday city driving. The dataset includes structured perception annotations for 80 MS-COCO object categories, generated through a combination of manual labeling (for time segments and vehicle types) and automated processing using YOLOv11x detection and BoT-SORT tracking. In total, 51,753 video clips—amounting to 20,275.56 hours—are released alongside corresponding detection and tracking results in CSV format. Distributed with comprehensive metadata to ensure reproducibility, CROWD substantially lowers the barrier for benchmarking cross-domain robustness and traffic interaction modeling.
📝 Abstract
We introduce CROWD (City Road Observations With Dashcams), a manually curated dataset of ordinary, minute scale, temporally contiguous, unedited, front facing urban dashcam segments screened and segmented from publicly available YouTube videos. CROWD is designed to support cross-domain robustness and interaction analysis by prioritising routine driving and explicitly excluding crashes, crash aftermath, and other edited or incident-focused content. The release contains 51,753 segment records spanning 20,275.56 hours (42,032 videos), covering 7,103 named inhabited places in 238 countries and territories across all six inhabited continents (Africa, Asia, Europe, North America, South America and Oceania), with segment level manual labels for time of day (day or night) and vehicle type. To lower the barrier for benchmarking, we provide per-segment CSV files of machine-generated detections for all 80 MS-COCO classes produced with YOLOv11x, together with segment-local multi-object tracks (BoT-SORT); e.g. person, bicycle, motorcycle, car, bus, truck, traffic light, stop sign, etc. CROWD is distributed as video identifiers with segment boundaries and derived annotations, enabling reproducible research without redistributing the underlying videos.