AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for cross-view geometric reconstruction and novel-view synthesis between aerial and ground-level images suffer from limited performance, primarily due to the scarcity of high-quality, co-registered cross-domain training data. To address this, we propose the first aerial-ground cross-domain hybrid data construction paradigm: controllable pseudo-aerial views are synthesized from city-scale 3D meshes (e.g., Google Earth), then tightly coupled with real crowdsourced street-view imagery to preserve fine-grained visual details—effectively bridging the sim-to-real domain gap. Our method integrates MegaDepth-based registration, DUSt3R fine-tuning, and cross-view geometric consistency optimization. In zero-shot camera pose estimation, the success rate of pose pairs with rotation error <5° improves dramatically—from under 5% to 56%. Consequently, novel-view synthesis quality and scene reconstruction robustness are significantly enhanced.

Technology Category

Application Category

📝 Abstract
We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Handling extreme viewpoint variation in aerial-ground image pairs
Lack of high-quality co-registered aerial-ground datasets for training
Improving geometric reconstruction and view synthesis in mixed aerial-ground scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines pseudo-synthetic and real crowd-sourced images
Uses 3D city-wide meshes for aerial viewpoints
Improves accuracy in aerial-ground tasks significantly
🔎 Similar Papers
2024-05-14arXiv.orgCitations: 2