GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This paper addresses the joint temporal (hour/month) and geographic (GPS) localization of outdoor images. We propose a decoupled–collaborative multimodal metric learning framework. Our key contributions are: (1) a toroidal periodic time metric loss that models temporal cyclicity via soft labels, mitigating reliance on hard negatives; (2) independent encoders projecting visual, temporal, and positional features into a unified high-dimensional embedding space, enabling both cross-modal alignment and disentangled optimization; and (3) end-to-end joint prediction without requiring ground-truth GPS coordinates during training. On standard benchmarks, our method significantly outperforms state-of-the-art approaches on both temporal classification and geographic localization—surpassing even GPS-supervised models. Moreover, it supports compositional and text-driven cross-modal image retrieval.

Technology Category

Application Category

📝 Abstract

Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, instead of conventional contrastive learning with hard positives and negatives, we propose a temporal metric-learning objective providing soft targets by modeling pairwise time differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses previous time prediction methods, even those using the ground-truth geo-location as an input during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, and the unified embedding space facilitates compositional and text-based image retrieval.

Problem

Research questions and friction points this paper is trying to address.

Predict image capture time and location jointly

Unify timestamp and geo-localization in shared space

Improve retrieval using cyclical temporal modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint embedding space for time and location prediction

Temporal metric-learning on cyclical toroidal surface

Separate encoders for image, time, and location

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs