🤖 AI Summary
Existing real-world driving datasets lack pixel-wise aligned images of the same scene under varying illumination and weather conditions, making it difficult to disentangle the impact of photometric changes on visual perception performance. To address this limitation, this work leverages the high-fidelity game engine GTA V and utilizes its API to programmatically generate strictly paired image sequences across multiple environmental conditions, while precisely preserving scene geometry, camera pose, and the identity and location of dynamic objects. This approach achieves, for the first time, a complete decoupling of photometric variation from geometric and semantic factors in driving scenarios. The resulting dataset establishes a controlled and fair benchmark, which is successfully employed to quantify performance degradation in semantic segmentation models and unambiguously attribute such degradation to photometric shifts.
📝 Abstract
Evaluating the performance of visual perception systems for autonomous driving is essential to ensure reliable operation across diverse environmental scenarios. Ideally, a balanced and fair analysis across different adverse conditions would require perfectly paired images of the same scene under different weather or illumination changes. This would allow evaluating the effect of photometric shifts independently of geometry and semantic changes. Unfortunately, real-world datasets rarely provide images of the same scene under different environmental conditions, because, normally, camera pose, traffic, and locations of dynamic objects (vehicles, pedestrians, etc.) vary over time, thus yielding only coarsely paired data. To address this challenge, this work introduces a data generation framework based on a high-fidelity game engine for extracting perfectly paired images. By leveraging software APIs that communicate with the GTA game engine, the framework modifies illumination and weather conditions while preserving scene geometry, camera pose, and the identity and placement of dynamic objects. For each sampled location, it procedurally instantiates dynamic entities and renders pixel-aligned images under diverse adverse conditions. The benefit of the proposed generation framework in driving scenarios is demonstrated through a systematic analysis of semantic segmentation models, whose output degradation can be attributed more directly to photometric shifts rather than to uncontrolled semantic or geometric factors.