EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track

πŸ“… 2026-03-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of cross-modal translation among electro-optical (EO), infrared (IR), and synthetic aperture radar (SAR) aerial imagery, which arises from their distinct electromagnetic and geometric characteristics. To this end, the authors propose EarthBridge, a novel framework that integrates a Diffusion Bridge Implicit Model (DBIM) with Contrastive Unpaired Translation (CUT). EarthBridge innovatively employs a non-Markovian bridge process to enable high-quality deterministic sampling and introduces a channel-concatenated UNet denoiser alongside a tailored β€œwarm-start noise” initialization mechanism to effectively mitigate ambiguity in cross-modal mapping. Evaluated on all four tasks of the MAVIC-T challenge, EarthBridge achieves a second-place overall score of 0.38, demonstrating significant improvements in spatial detail and spectral fidelity of the generated images.

Technology Category

Application Category

πŸ“ Abstract
Cross-modal image-to-image translation among Electro-Optical (EO), Infrared (IR), and Synthetic Aperture Radar (SAR) sensors is essential for comprehensive multi-modal aerial-view analysis. However, translating between these modalities is notoriously difficult due to their distinct electromagnetic signatures and geometric characteristics. This paper presents \textbf{EarthBridge}, a high-fidelity translation framework developed for the 4th Multi-modal Aerial View Image Challenge -- Translation (MAVIC-T). We explore two distinct methodologies: \textbf{Diffusion Bridge Implicit Models (DBIM)}, which we generalize using non-Markovian bridge processes for high-quality deterministic sampling, and \textbf{Contrastive Unpaired Translation (CUT)}, which utilizes contrastive learning for structural consistency. Our EarthBridge framework employs a channel-concatenated UNet denoiser trained with Karras-weighted bridge scalings and a specialized"booting noise"initialization to handle the inherent ambiguity in cross-modal mappings. We evaluate these methods across all four challenge tasks (SAR$\rightarrow$EO, SAR$\rightarrow$RGB, SAR$\rightarrow$IR, RGB$\rightarrow$IR), achieving superior spatial detail and spectral accuracy. Our solution achieved a composite score of 0.38, securing the second position on the MAVIC-T leaderboard. Code is available at https://github.com/Bili-Sakura/EarthBridge-Preview.
Problem

Research questions and friction points this paper is trying to address.

cross-modal image translation
multi-modal aerial imagery
EO-IR-SAR translation
image-to-image translation
aerial view analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Bridge Implicit Models
Contrastive Unpaired Translation
Non-Markovian Bridge Processes
Cross-modal Image Translation
Bootstrapped Noise Initialization
πŸ”Ž Similar Papers
No similar papers found.