MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weakly supervised 3D object detection suffers from projection ambiguity and single-view occlusion due to reliance solely on 2D bounding box annotations. To address this, we propose a teacher–student distillation framework that fuses temporal and multi-view information. Our key contributions are: (1) robust static object representation via temporal point cloud aggregation; (2) a multi-view teacher network that generates high-quality 3D pseudo-labels, improving detection of both dynamic and static objects; and (3) a multi-view 2D projection consistency loss to enforce geometric constraints. The method operates entirely without 3D ground-truth annotations. Evaluated on nuScenes and Waymo Open Dataset, it achieves state-of-the-art performance, significantly narrowing the gap with fully supervised methods—improving nuScenes mAP by 8.2% and Waymo LEVEL_1 metric by 6.5%.

Technology Category

Application Category

📝 Abstract
Annotating 3D data remains a costly bottleneck for 3D object detection, motivating the development of weakly supervised annotation methods that rely on more accessible 2D box annotations. However, relying solely on 2D boxes introduces projection ambiguities since a single 2D box can correspond to multiple valid 3D poses. Furthermore, partial object visibility under a single viewpoint setting makes accurate 3D box estimation difficult. We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges. Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible. A Teacher-Student distillation paradigm is employed: The Teacher network learns from single viewpoints but targets are derived from temporally aggregated static objects. Then the Teacher generates high quality pseudo-labels that the Student learns to predict from a single viewpoint for both static and moving objects. The whole framework incorporates a multi-view 2D projection loss to enforce consistency between predicted 3D boxes and all available 2D annotations. Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations. % footnote{Code available upon acceptance} Our code is available in our public repository (href{https://github.com/CEA-LIST/MVAT}{code}).
Problem

Research questions and friction points this paper is trying to address.

Reducing 3D annotation costs using 2D box supervision
Resolving projection ambiguities from single 2D viewpoints
Addressing partial object visibility in 3D detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages temporal multi-view data aggregation
Uses Teacher-Student distillation with pseudo-labels
Incorporates multi-view 2D projection consistency loss
🔎 Similar Papers
No similar papers found.
S
Saad Lahlali
Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France
A
Alexandre Fournier Montgieux
Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France
N
Nicolas Granger
Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France
Hervé Le Borgne
Hervé Le Borgne
CEA List, France
multimedia content analysismultimedia information retrievalzero shot learningcomputer vision
Q
Quoc Cuong Pham
Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France