🤖 AI Summary
In dense image matching, the single-correspondence assumption commonly fails under challenging scenarios such as depth discontinuities and large-scale variations. To address this, this paper proposes the first multi-hypothesis modeling framework tailored for coarse-to-fine matching. Our method generates multiple correspondence hypotheses per source pixel at each scale, employs a beam-search mechanism to propagate and prune high-confidence hypotheses across scales, and introduces a cross-attention module that integrates multiple hypotheses to enable inter-hypothesis information exchange and dynamic fusion. Built upon a Transformer-based multi-scale architecture, our approach significantly improves matching robustness and accuracy in complex scenes. Extensive experiments demonstrate state-of-the-art performance on standard benchmarks—particularly excelling under severe depth discontinuities and strong scale changes—outperforming all existing methods.
📝 Abstract
Dense image matching aims to find a correspondent for every pixel of a source image in a partially overlapping target image. State-of-the-art methods typically rely on a coarse-to-fine mechanism where a single correspondent hypothesis is produced per source location at each scale. In challenging cases -- such as at depth discontinuities or when the target image is a strong zoom-in of the source image -- the correspondents of neighboring source locations are often widely spread and predicting a single correspondent hypothesis per source location at each scale may lead to erroneous matches. In this paper, we investigate the idea of predicting multiple correspondent hypotheses per source location at each scale instead. We consider a beam search strategy to propagat multiple hypotheses at each scale and propose integrating these multiple hypotheses into cross-attention layers, resulting in a novel dense matching architecture called BEAMER. BEAMER learns to preserve and propagate multiple hypotheses across scales, making it significantly more robust than state-of-the-art methods, especially at depth discontinuities or when the target image is a strong zoom-in of the source image.