🤖 AI Summary
This work addresses the lack of multi-scale modeling paradigms in image-based scene flow estimation. We pioneer the adaptation of the successful multi-scale recurrent architecture from optical flow to scene flow estimation. Our end-to-end coarse-to-fine hierarchical framework comprises: (1) a multi-scale feature and context encoder tailored for 3D motion modeling; (2) an optical-flow-guided iterative refinement mechanism operating across hierarchy levels; and (3) a hierarchical loss function jointly enforcing geometric and photometric consistency. Built upon the RAFT architecture, our method integrates a multi-scale feature pyramid with cross-scale feature interaction. On the KITTI and Spring benchmarks, it achieves new state-of-the-art performance—improving accuracy by 8.7% and 65.8%, respectively—demonstrating significant gains in both precision and generalization. The source code is publicly available.
📝 Abstract
Although multi-scale concepts have recently proven useful for recurrent network architectures in the field of optical flow and stereo, they have not been considered for image-based scene flow so far. Hence, based on a single-scale recurrent scene flow backbone, we develop a multi-scale approach that generalizes successful hierarchical ideas from optical flow to image-based scene flow. By considering suitable concepts for the feature and the context encoder, the overall coarse-to-fine framework and the training loss, we succeed to design a scene flow approach that outperforms the current state of the art on KITTI and Spring by 8.7%(3.89 vs. 4.26) and 65.8% (9.13 vs. 26.71), respectively. Our code is available at https://github.com/cv-stuttgart/MS-RAFT-3D.