MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model

📅 2025-02-01

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address large depth estimation errors and low efficiency of multi-stage pipelines in monocular 3D object detection, this paper proposes an end-to-end, single-stage detection framework based on Vision Transformers (ViT). Methodologically: (1) it pioneers joint modeling of a ViT backbone with a DETR decoder; (2) introduces a hierarchical feature fusion module to enhance multi-scale representation; (3) incorporates a transferable relative depth estimation module to improve depth accuracy; and (4) explicitly encodes 2D bounding box reference points and dimensions into DETR queries to strengthen geometric constraints. Evaluated on the KITTI 3D benchmark and a newly constructed high-altitude racing dataset, our method significantly outperforms recent state-of-the-art approaches, achieving notable improvements in both 3D detection average precision (AP) and depth estimation error. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

This paper proposes novel methods to enhance the performance of monocular 3D object detection models by leveraging the generalized feature extraction capabilities of a vision foundation model. Unlike traditional CNN-based approaches, which often suffer from inaccurate depth estimation and rely on multi-stage object detection pipelines, this study employs a Vision Transformer (ViT)-based foundation model as the backbone, which excels at capturing global features for depth estimation. It integrates a detection transformer (DETR) architecture to improve both depth estimation and object detection performance in a one-stage manner. Specifically, a hierarchical feature fusion block is introduced to extract richer visual features from the foundation model, further enhancing feature extraction capabilities. Depth estimation accuracy is further improved by incorporating a relative depth estimation model trained on large-scale data and fine-tuning it through transfer learning. Additionally, the use of queries in the transformer's decoder, which consider reference points and the dimensions of 2D bounding boxes, enhances recognition performance. The proposed model outperforms recent state-of-the-art methods, as demonstrated through quantitative and qualitative evaluations on the KITTI 3D benchmark and a custom dataset collected from high-elevation racing environments. Code is available at https://github.com/JihyeokKim/MonoDINO-DETR.

Problem

Research questions and friction points this paper is trying to address.

3D Environment

Object Recognition

Distance Estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

MonoDINO-DETR

Visual Transformer (ViT)

3D Object Detection

🔎 Similar Papers

No similar papers found.