CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the degradation and spatial misalignment of LiDAR and camera modalities in drone-based top-down scenes caused by tree canopy occlusion. To tackle this challenge, the authors propose an occlusion-aware multimodal fusion framework that explicitly models bimodal occlusion intensity for the first time in UAV 3D object detection. The approach employs a physically inspired closed-form model grounded in the Beer–Lambert law, integrated with building mask correction and an occlusion prediction network to embed occlusion priors throughout the entire pipeline—including data augmentation, feature encoding, fusion, and the detection head. Evaluated on the newly introduced UAV datasets SI3D-DI and SI3D-DII, the method achieves significant performance gains, improving BEV mAP by 9.43% and 4.88% respectively under hard difficulty settings, outperforming existing state-of-the-art approaches.

📝 Abstract

Multimodal 3D object detection based on LiDAR and cameras has demonstrated excellent performance in ground-vehicle scenarios, but has not been explored for Unmanned Aerial Vehicle (UAV) platforms. In UAV top-down scenes, frequent groundobject occlusion dominated by tree canopies causes spatially varying and modality-dependent information degradation. Existing multimodal fusion frameworks neither explicitly model such ground-object occlusion nor embed occlusion awareness into the detection pipeline, limiting their performance in occluded UAV scenes. To address these challenges, we propose CAMF-Det, a closure-aware multimodal fusion framework for LiDAR-camera 3D object detection on UAV platforms, which derives dual-modal occlusion intensity through physics-inspired modeling and embeds them as priors throughout the detection pipeline. First, a dual-modal closure modeling module explicitly constructs occlusion intensity ground truth for both modalities offline via a Beer-Lambert-inspired formulation and building-mask correction. Second, using these ground-truth maps as supervision, a dual-modal prediction network converts the offline modeling results into online occlusion intensity predictions under single-frame inference. Third, both ground-truth and predicted occlusion intensity are injected into data augmentation, feature encoding, multimodal fusion, and detection head, enabling adaptive detection under spatially varying and modality-dependent information degradation. Experiments on two self-built UAV-based multimodal datasets, SI3D-DI and SI3D-DII, demonstrate that CAMF-Det achieves the best performance across all difficulty levels, with hard-level mAP$_{\mathrm{BEV}}$ improvements of 9.43% and 4.88% over the best competing methods, respectively. These results confirm the effectiveness of explicit occlusion prior modeling and exploitation for robust multimodal 3D detection in UAV scenes.

Problem

Research questions and friction points this paper is trying to address.

occlusion

multimodal fusion

3D object detection

UAV

LiDAR-camera

Innovation

Methods, ideas, or system contributions that make the work stand out.

occlusion-aware

multimodal fusion

UAV 3D detection