Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Stereo matching models suffer from poor cross-domain robustness and struggle with domain shifts and imbalanced disparity distributions. To address these challenges, this paper proposes a lightweight adaptive framework integrating Vision Foundation Models (VFMs) with Mixture-of-Experts (MoE). Our method introduces (1) MoE-LoRA and MoE-Adapter modules, incorporating variable-rank low-rank adaptation and dynamic-kernel-size convolutions; (2) a lightweight decision network that selectively activates experts based on input complexity, enabling efficient inference; and (3) zero-shot adaptability—achieving state-of-the-art performance across multiple cross-domain and joint generalization benchmarks without dataset-specific fine-tuning. The framework jointly optimizes robustness, accuracy, and computational efficiency, significantly enhancing practicality in real-world scenarios. Experimental results demonstrate consistent improvements over prior methods under domain shifts, while maintaining low parameter overhead and inference latency.

Technology Category

Application Category

📝 Abstract
Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at extcolor{red}{https://github.com/cocowy1/SMoE-Stereo}.
Problem

Research questions and friction points this paper is trying to address.

Enhance robustness in stereo matching across diverse domains
Integrate Vision Foundation Models cost-effectively for better performance
Balance computational efficiency and accuracy in stereo matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MoE-LoRA with adaptive ranks
Integrates MoE-Adapter with adaptive kernels
Employs lightweight decision network selectively
🔎 Similar Papers
No similar papers found.
Y
Yun Wang
City University of Hong Kong
Longguang Wang
Longguang Wang
NUDT
low-level vision3D visiondeep learning
Chenghao Zhang
Chenghao Zhang
Renmin University of China
Natural Language ProcessingInformation RetrievalMultimodal
Y
Yongjian Zhang
Shenzhen Campus, Sun Yat-sen University
Zhanjie Zhang
Zhanjie Zhang
Zhejiang University
computer vision
Ao Ma
Ao Ma
JD.com
Generative AIVideo Generation
C
Chenyou Fan
South China Normal University
T
Tin Lun Lam
The Chinese University of Hong Kong, Shenzhen
J
Junjie Hu
The Chinese University of Hong Kong, Shenzhen