Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unsupervised monocular depth and ego-motion estimation methods underutilize geometric constraints, leading to degraded performance under challenging conditions such as illumination changes and rapid motion. Method: We propose motion component decoupling: explicitly separating translation and rotation estimation, and enforcing closed-form geometric constraints—co-axial (depth–translation) via camera optical-axis alignment and co-planar (rotation–optical flow) via image-plane projection. We introduce differentiable optical flow warping and geometric alignment modules to achieve explicit source–target frame alignment, with alignment residuals quantifying per-component motion errors. Depth and translation are made mutually inferable, strengthening complementary constraints. Results: Our method achieves state-of-the-art performance on KITTI, Make3D, and a newly collected real-world dataset. It demonstrates significantly improved robustness under complex scenarios—including varying illumination and fast ego-motion—while maintaining computational efficiency and end-to-end differentiability.

Technology Category

Application Category

📝 Abstract
Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.
Problem

Research questions and friction points this paper is trying to address.

Discriminatively treats motion components for joint depth and ego-motion learning
Imposes geometric constraints individually on each ego-motion component
Reformulates joint learning using coaxial and coplanar geometric relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discriminative treatment of motion components
Aligns optical axes and imaging planes
Reformulates learning into coaxial coplanar forms
🔎 Similar Papers
No similar papers found.
M
Mengtan Zhang
Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai 201210, China
Z
Zizhan Guo
College of Electronic & Information Engineering, Tongji University, Shanghai 201804, China
H
Hongbo Zhao
Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai 201210, China
Y
Yi Feng
College of Electronic & Information Engineering, Tongji University, Shanghai 201804, China
Z
Zuyi Xiong
College of Electronic & Information Engineering, Tongji University, Shanghai 201804, China
Y
Yue Wang
Department of Control Science and Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, China
Shaoyi Du
Shaoyi Du
Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University
Pattern RecognitionComputer VisionImage Processing
Hanli Wang
Hanli Wang
Tongji University
Multimedia ComputingComputer VisionImage ProcessingMachine Learning
R
Rui Fan
College of Electronic & Information Engineering, Shanghai Institute of Intelligent Science and Technology, Shanghai Research Institute for Intelligent Autonomous Systems, the State Key Laboratory of Autonomous Intelligent Unmanned Systems, the Frontiers Science Center for Intelligent Autonomous Systems (Ministry of Education), and Shanghai Key Laboratory of Intelligent Autonomous Systems, Tongji University, Shanghai 201804, China, as well as with the National Key Laboratory of Human-Machine Hybrid Augmented