🤖 AI Summary
This work addresses the lack of robustness in camera-based bird’s-eye view (BEV) detection models for autonomous driving under natural semantic perturbations—such as geometric transformations, chromatic shifts, and motion blur. We propose the first black-box semantic robustness evaluation framework specifically designed for BEV detection. Our method introduces a novel black-box semantic perturbation attack paradigm, incorporating a smooth distance surrogate function and the deterministic SimpleDIRECT optimization algorithm to overcome the non-differentiability of mean Average Precision (mAP), thereby enabling efficient multi-perturbation joint search. Benchmarking across ten state-of-the-art BEV models reveals PolarFormer as the most robust, while BEVDet’s accuracy collapses entirely to zero. Our approach significantly outperforms both random perturbations and two representative optimization-based baselines. The framework establishes a reproducible, scalable, and principled standard for evaluating semantic robustness in BEV perception systems.
📝 Abstract
Camera-based Bird's Eye View (BEV) perception models receive increasing attention for their crucial role in autonomous driving, a domain where concerns about the robustness and reliability of deep learning have been raised. While only a few works have investigated the effects of randomly generated semantic perturbations, aka natural corruptions, on the multi-view BEV detection task, we develop a black-box robustness evaluation framework that adversarially optimises three common semantic perturbations: geometric transformation, colour shifting, and motion blur, to deceive BEV models, serving as the first approach in this emerging field. To address the challenge posed by optimising the semantic perturbation, we design a smoothed, distance-based surrogate function to replace the mAP metric and introduce SimpleDIRECT, a deterministic optimisation algorithm that utilises observed slopes to guide the optimisation process. By comparing with randomised perturbation and two optimisation baselines, we demonstrate the effectiveness of the proposed framework. Additionally, we provide a benchmark on the semantic robustness of ten recent BEV models. The results reveal that PolarFormer, which emphasises geometric information from multi-view images, exhibits the highest robustness, whereas BEVDet is fully compromised, with its precision reduced to zero.