CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing cross-view geolocalization methods struggle to simultaneously achieve large-scale image retrieval and high-precision 3D pose estimation, and their cascaded architectures often suffer from error propagation and feature inconsistency. This work proposes CIPER, a unified framework that jointly models both tasks for the first time. CIPER employs a shared Transformer encoder with task-specific tokens to co-learn global and spatial features, and introduces a bidirectional cross-attention pose decoder along with a set prediction strategy to effectively bridge the domain gap between ground and aerial views. Experiments on the VIGOR, KITTI, and Ford Multi-AV datasets demonstrate that the method achieves superior robustness and localization accuracy under challenging conditions such as limited field of view and arbitrary orientations.

📝 Abstract

Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.

Problem

Research questions and friction points this paper is trying to address.

cross-view geo-localization

image retrieval

pose estimation

3-DoF localization

domain gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-view geo-localization

unified framework

transformer-based pose estimation