Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

In high-dynamic scenes, large inter-frame displacements cause spatial blur and temporal motion discontinuity; existing frame-event fusion methods suffer from misaligned features and degraded optical flow accuracy due to modality heterogeneity. To address this, we propose a shared latent space modeling framework that (i) introduces, for the first time, a visual boundary localization mechanism grounded in joint spatiotemporal gradients, and (ii) designs a boundary-guided sparse-dense motion correlation complementary fusion paradigm. Our approach achieves explicit spatiotemporal gradient alignment and semantic complementarity between frames and events, ensuring both interpretability and dense, temporally consistent optical flow estimation. Evaluated on multiple high-dynamic benchmarks, our method significantly outperforms state-of-the-art approaches, effectively mitigating modality heterogeneity bias while improving optical flow accuracy and temporal coherence.

Technology Category

Application Category

📝 Abstract

High-dynamic scene optical flow is a challenging task, which suffers spatial blur and temporal discontinuous motion due to large displacement in frame imaging, thus deteriorating the spatiotemporal feature of optical flow. Typically, existing methods mainly introduce event camera to directly fuse the spatiotemporal features between the two modalities. However, this direct fusion is ineffective, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. To address this issue, we explore a common-latent space as an intermediate bridge to mitigate the modality gap. In this work, we propose a novel common spatiotemporal fusion between frame and event modalities for high-dynamic scene optical flow, including visual boundary localization and motion correlation fusion. Specifically, in visual boundary localization, we figure out that frame and event share the similar spatiotemporal gradients, whose similarity distribution is consistent with the extracted boundary distribution. This motivates us to design the common spatiotemporal gradient to constrain the reference boundary localization. In motion correlation fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has spatially sparse but temporally continuous correlation. This inspires us to use the reference boundary to guide the complementary motion knowledge fusion between the two modalities. Moreover, common spatiotemporal fusion can not only relieve the cross-modal feature discrepancy, but also make the fusion process interpretable for dense and continuous optical flow. Extensive experiments have been performed to verify the superiority of the proposed method.

Problem

Research questions and friction points this paper is trying to address.

High-dynamic scene optical flow suffers spatial blur and temporal discontinuity.

Existing methods fail due to heterogeneous data representation between frame and event modalities.

Proposed method bridges modality gap using common-latent space for effective spatiotemporal fusion.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Common-latent space bridges frame and event modalities.

Visual boundary localization uses shared spatiotemporal gradients.

Motion correlation fusion combines dense and sparse motion data.

🔎 Similar Papers

Generalizable Implicit Motion Modeling for Video Frame Interpolation