๐ค AI Summary
Addressing the challenge of reconstructing high-fidelity 3D ground truth from multi-camera 2D annotations (bounding boxes/keypoints), this paper proposes a fully automatic 3D reconstruction framework. Leveraging multi-view geometry and nonlinear filtering, it integrates unscented Kalman filtering (UKF) for state estimation, homography-based projection, single-object tracking, and cross-camera 2D observation fusionโwithout requiring depth sensors or planar ground assumptions. The method robustly estimates object 3D position, orientation, and complete shape. To our knowledge, it is the first end-to-end approach to generate multi-camera 3D ground truth solely from 2D annotations, effectively handling occlusions and supporting arbitrary camera topologies. Evaluated on CMC, WildTrack, and Panoptic datasets, it achieves significantly higher 3D localization accuracy than state-of-the-art methods. This enables scalable, high-fidelity 3D supervision for applications such as autonomous driving and intelligent surveillance.
๐ Abstract
Accurate 3D ground truth estimation is critical for applications such as autonomous navigation, surveillance, and robotics. This paper introduces a novel method that uses an Unscented Kalman Filter (UKF) to fuse 2D bounding box or pose keypoint ground truth annotations from multiple calibrated cameras into accurate 3D ground truth. By leveraging human-annotated ground-truth 2D, our proposed method, a multi-camera single-object tracking algorithm, transforms 2D image coordinates into robust 3D world coordinates through homography-based projection and UKF-based fusion. Our proposed algorithm processes multi-view data to estimate object positions and shapes while effectively handling challenges such as occlusion. We evaluate our method on the CMC, Wildtrack, and Panoptic datasets, demonstrating high accuracy in 3D localization compared to the available 3D ground truth. Unlike existing approaches that provide only ground-plane information, our method also outputs the full 3D shape of each object. Additionally, the algorithm offers a scalable and fully automatic solution for multi-camera systems using only 2D image annotations.