DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Commercial RGB-D cameras yield highly noisy and severely incomplete depth maps for non-Lambertian objects, while existing depth completion methods suffer from poor generalization. To address this, we propose a diffusion-based depth completion framework. Our key contributions are: (1) a zero-terminal signal-to-noise-ratio rescaled noise scheduler that eliminates signal leakage bias; (2) a noise-agnostic single-step training strategy to mitigate exposure bias; and (3) a semantic-enhanced multi-task architecture jointly optimizing depth completion and semantic segmentation. The method leverages visual priors from pre-trained text-to-image diffusion models and incorporates task-specific loss functions. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, with significant improvements in real-world generalization and downstream task performance—including category-level pose estimation and robotic grasping.

Technology Category

Application Category

📝 Abstract

Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction. To address these issues, we propose extbf{DidSee}, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps. DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic grasping.Project page: https://wenzhoulyu.github.io/DidSee/

Problem

Research questions and friction points this paper is trying to address.

Noisy, incomplete depth maps from RGB-D cameras for non-Lambertian objects

Biases in vanilla diffusion framework impair depth completion performance

Lack of visual features in non-Lambertian regions hinders precise prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rescaled noise scheduler for zero signal leakage

Noise-agnostic single-step training formulation

Semantic enhancer for joint depth segmentation

🔎 Similar Papers

ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation