🤖 AI Summary
Supervised stereo matching methods suffer from poor generalization in real-world scenarios due to the scarcity of annotated stereo image pairs.
Method: This paper proposes a novel single-image-to-stereo method that requires no ground-truth stereo pairs for training. It first estimates monocular depth to obtain pseudo-disparity, then jointly reconstructs occluded regions in the right view via pseudo-disparity-guided geometric warping and a diffusion-based inpainting model (with a fine-tuned inpainting module). Crucially, it introduces a training-free confidence generation mechanism and an adaptive disparity sampling strategy to robustly handle occlusions.
Results: The method achieves state-of-the-art performance on zero-shot stereo matching. Synthesized stereo pairs exhibit rich texture detail, semantic consistency, and structural integrity, significantly enhancing generalization to real-world scenes without domain-specific fine-tuning.
📝 Abstract
State-of-the-art supervised stereo matching methods have achieved amazing results on various benchmarks. However, these data-driven methods suffer from generalization to real-world scenarios due to the lack of real-world annotated data. In this paper, we propose StereoGen, a novel pipeline for high-quality stereo image generation. This pipeline utilizes arbitrary single images as left images and pseudo disparities generated by a monocular depth estimation model to synthesize high-quality corresponding right images. Unlike previous methods that fill the occluded area in warped right images using random backgrounds or using convolutions to take nearby pixels selectively, we fine-tune a diffusion inpainting model to recover the background. Images generated by our model possess better details and undamaged semantic structures. Besides, we propose Training-free Confidence Generation and Adaptive Disparity Selection. The former suppresses the negative effect of harmful pseudo ground truth during stereo training, while the latter helps generate a wider disparity distribution and better synthetic images. Experiments show that models trained under our pipeline achieve state-of-the-art zero-shot generalization results among all published methods. The code will be available upon publication of the paper.