🤖 AI Summary
This work addresses the cross-view perception challenges of identity alignment for urban traffic objects and monocular-to-bird’s-eye-view localization across street-level and aerial perspectives. The authors introduce a novel dataset comprising synchronized first-person bicycle videos and drone-captured aerial footage, offering the first identity-level aligned data across such extreme viewpoints. The pipeline leverages synchronized multi-view acquisition, trajectory-level annotations, and inverse perspective mapping, combined with MonoLayout-inspired learning and regression models to enable cross-view identity matching and bird’s-eye-view prediction from monocular images under aerial supervision. Experiments demonstrate high recall in cross-view matching, though performance is limited by over-allocation and temporal inconsistency; monocular prediction significantly improves with aerial supervision yet leaves room for optimization in lightweight settings. The accompanying standardized evaluation protocol, annotation toolkit, and baseline methods aim to advance research in cross-view urban traffic understanding.
📝 Abstract
We introduce a dataset and benchmark for cross-view urban traffic perception built from synchronized ego-centric bicycle videos and aerial drone videos recorded at real urban intersections. The benchmark targets two linked tasks: cross-view identity matching between street-view and drone-view object tracks, and ego-to-bird's-eye-view prediction using aerial supervision. In contrast to prior urban driving and V2X datasets, our benchmark provides identity-level alignment across radically different viewpoints together with standardized evaluation, annotation tooling, and baseline implementations. This setting is motivated by intersection-centric traffic analysis, where identity preservation, local interactions, and global spatial structure must be reasoned about jointly across views. We evaluate methods at both the track and frame levels, including cross-view ID precision/recall/IDF1, near--far breakdowns, temporal stability, and consistency metrics. We also provide baseline results for wedge-based cross-view matching and for three BEV prediction baselines: inverse perspective mapping, a MonoLayout-style learned baseline, and a regression baseline. The results show that the benchmark is feasible but challenging: cross-view matching achieves strong recall yet remains limited by over-assignment and temporal inconsistency, while ego-to-BEV prediction benefits from aerial supervision but remains far from saturated under lightweight monocular sensing. We hope that this benchmark will support future research on cross-view perception, urban scene alignment, and ego-to-global traffic understanding.