Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing multimodal large language models lack systematic modeling of geometric structure, viewpoint variation, fine-grained correspondence, and occlusion relationships in wide-baseline matching tasks. To address this limitation, this work introduces ReasonMatch-Bench, a new benchmark, and proposes Dynamic Correspondence Reinforcement Learning (DCRL), which trains models by automatically mining verifiable wide-baseline image pairs from video and 3D data. DCRL pioneers a reinforcement learning framework that jointly integrates image-level progressive viewpoint reasoning and point-level correspondence curricula, effectively enhancing spatial reasoning capabilities without explicit chain-of-thought supervision. Experiments demonstrate that DCRL boosts the F1 score on the challenging subset of ReasonMatch-Bench from 37.2 to 84.0—approaching human-level performance—and exhibits strong transferability across multiple spatial reasoning tasks.

📝 Abstract

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Wide-Baseline Matching

Spatial Reasoning

Multimodal Large Language Models

Fine-grained Correspondence

Viewpoint Displacement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wide-Baseline Matching

ReasonMatch-Bench

Dynamic Correspondence Reinforcement Learning