🤖 AI Summary
This study addresses the performance limitations of 3D human detection in diverse indoor and outdoor environments under challenges such as occlusion, varying distances, and sensor anomalies. It presents a systematic evaluation of three representative approaches—camera-only (BEVDepth), LiDAR-only (PointPillars), and multimodal fusion (DAL)—on the JRDB dataset. As the first comprehensive comparison of multimodal 3D human detection outside autonomous driving scenarios, this work provides an in-depth analysis of how sensor failures and calibration errors affect the robustness of fusion systems. Experimental results demonstrate that DAL generally outperforms single-modality methods in complex scenes but remains sensitive to LiDAR noise and sensor misalignment, whereas the vision-only approach suffers significantly from occlusion and long-range detection.
📝 Abstract
Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.