MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Weakly supervised 3D object detection suffers from projection ambiguity and single-view occlusion due to reliance solely on 2D bounding box annotations. To address this, we propose a teacher–student distillation framework that fuses temporal and multi-view information. Our key contributions are: (1) robust static object representation via temporal point cloud aggregation; (2) a multi-view teacher network that generates high-quality 3D pseudo-labels, improving detection of both dynamic and static objects; and (3) a multi-view 2D projection consistency loss to enforce geometric constraints. The method operates entirely without 3D ground-truth annotations. Evaluated on nuScenes and Waymo Open Dataset, it achieves state-of-the-art performance, significantly narrowing the gap with fully supervised methods—improving nuScenes mAP by 8.2% and Waymo LEVEL_1 metric by 6.5%.

Technology Category

Application Category

📝 Abstract

Annotating 3D data remains a costly bottleneck for 3D object detection, motivating the development of weakly supervised annotation methods that rely on more accessible 2D box annotations. However, relying solely on 2D boxes introduces projection ambiguities since a single 2D box can correspond to multiple valid 3D poses. Furthermore, partial object visibility under a single viewpoint setting makes accurate 3D box estimation difficult. We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges. Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible. A Teacher-Student distillation paradigm is employed: The Teacher network learns from single viewpoints but targets are derived from temporally aggregated static objects. Then the Teacher generates high quality pseudo-labels that the Student learns to predict from a single viewpoint for both static and moving objects. The whole framework incorporates a multi-view 2D projection loss to enforce consistency between predicted 3D boxes and all available 2D annotations. Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations. % footnote{Code available upon acceptance} Our code is available in our public repository (href{https://github.com/CEA-LIST/MVAT}{code}).

Problem

Research questions and friction points this paper is trying to address.

Reducing 3D annotation costs using 2D box supervision

Resolving projection ambiguities from single 2D viewpoints

Addressing partial object visibility in 3D detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages temporal multi-view data aggregation

Uses Teacher-Student distillation with pseudo-labels

Incorporates multi-view 2D projection consistency loss

🔎 Similar Papers

No similar papers found.

Authors to Follow