Understanding Cross-Sensor Feature Variations for Generalizable 3D Perception

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the significant performance degradation in cross-dataset radar-camera bird’s-eye-view (BEV) perception caused by discrepancies in scenes, sensor configurations, and environmental conditions. Existing approaches fail to effectively model the feature variations originating from the source domain. To tackle this, the paper introduces frequency-domain scene variation modeling into multimodal 3D perception for the first time. By synthesizing diverse source-domain views through frequency-domain transformations, it analyzes their impact on BEV fusion features and proposes a spatial regularization mechanism integrated within the fusion process. This approach enhances model generalization without requiring any target-domain data. Consistent performance gains are demonstrated across multiple BEV backbones on cross-dataset 3D detection tasks between View-of-Delft and TJ4DRadSet, with notable improvements persisting even when limited target-domain data is available.

📝 Abstract

Radar-camera BEV perception often suffers from degraded performance when evaluated across datasets, as changes in driving scenes, sensor configurations, and environmental conditions can alter both the input observations and the internal fused representations. This work studies this issue from the perspective of source-domain variation modeling, aiming to improve the robustness of BEV-based 3D detectors without relying on target-domain samples. We introduce a framework that characterizes visual scene variations in the frequency domain and uses them to synthesize diverse source-domain views. By comparing the resulting fused BEV representations, the framework further captures how image-level variations influence multi-modal BEV features. These variation patterns are then used to regularize the detector, encouraging the learned fusion space to remain stable under latent scene changes. The proposed method is applied only during training and leaves the inference pipeline unchanged. Experiments on cross-dataset radar-camera 3D detection between View-of-Delft and TJ4DRadSet demonstrate consistent improvements over multiple BEV fusion backbones, and the gains remain effective when a small amount of target-domain data is available.

Problem

Research questions and friction points this paper is trying to address.

cross-dataset generalization

radar-camera fusion

BEV perception

feature variation

3D object detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-sensor generalization

BEV perception

frequency-domain variation modeling