CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Addressing the dual challenges of computational constraints and scarcity of ground-truth depth annotations in edge scenarios such as autonomous driving and robotics, this paper proposes CCNeXt—a highly efficient CNN architecture for self-supervised stereo matching. Methodologically, CCNeXt introduces a windowed epipolar cross-attention module to explicitly model cross-view feature correlations under epipolar geometry constraints, and redesigns a lightweight depth decoder to improve the accuracy–speed trade-off. Trained end-to-end solely on stereo image pairs—without any real depth supervision—CCNeXt achieves state-of-the-art efficiency and accuracy. On the KITTI Eigen split, it runs 10.18× faster than the current SOTA model; on both the KITTI refined ground truth and Driving Stereo benchmarks, it attains the best overall performance in absolute relative error (AbsRel↓) and thresholded accuracy (δ<1.25↑). These results significantly advance the practical deployment of self-supervised depth estimation in resource-constrained edge environments.

Technology Category

Application Category

📝 Abstract

Depth Estimation plays a crucial role in recent applications in robotics, autonomous vehicles, and augmented reality. These scenarios commonly operate under constraints imposed by computational power. Stereo image pairs offer an effective solution for depth estimation since it only needs to estimate the disparity of pixels in image pairs to determine the depth in a known rectified system. Due to the difficulty in acquiring reliable ground-truth depth data across diverse scenarios, self-supervised techniques emerge as a solution, particularly when large unlabeled datasets are available. We propose a novel self-supervised convolutional approach that outperforms existing state-of-the-art Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) while balancing computational cost. The proposed CCNeXt architecture employs a modern CNN feature extractor with a novel windowed epipolar cross-attention module in the encoder, complemented by a comprehensive redesign of the depth estimation decoder. Our experiments demonstrate that CCNeXt achieves competitive metrics on the KITTI Eigen Split test data while being 10.18$ imes$ faster than the current best model and achieves state-of-the-art results in all metrics in the KITTI Eigen Split Improved Ground Truth and Driving Stereo datasets when compared to recently proposed techniques. To ensure complete reproducibility, our project is accessible at href{https://github.com/alelopes/CCNext}{ exttt{https://github.com/alelopes/CCNext}}.

Problem

Research questions and friction points this paper is trying to address.

Self-supervised stereo depth estimation under computational constraints

Overcoming difficulty in acquiring reliable ground-truth depth data

Balancing accuracy and computational efficiency in depth estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised convolutional approach for stereo depth

Modern CNN feature extractor with epipolar attention

Redesigned decoder balancing speed and accuracy

🔎 Similar Papers

Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes