Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in urban image region representation learning: (1) fine-grained alignment between visual features and lengthy textual descriptions, and (2) annotation noise introduced by large language model (LLM)-generated captions. To tackle these, we propose a long-text-aware and noise-robust cross-modal learning framework. Methodologically, we first introduce an information-preserving stretch interpolation strategy to enable token-level fine-grained alignment between visual features and long text sequences. Second, we design a two-stage optimization framework integrating multi-model collaborative caption generation, momentum-based self-distillation for pseudo-label learning, and contrastive loss to enhance noise resilience. Extensive experiments on four real-world urban datasets demonstrate that our approach significantly outperforms state-of-the-art methods on downstream tasks—including semantic segmentation and region retrieval—while exhibiting strong cross-regional generalization and practical applicability.

Technology Category

Application Category

📝 Abstract
Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data. Analogous to how perceived facial age reflects an individual's health, the visual appearance of a city serves as its ``portrait", encapsulating latent socio-economic and environmental characteristics. Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning. However, two major challenges remain: i)~difficulty in aligning fine-grained visual features with long captions, and ii) suboptimal knowledge incorporation due to noise in LLM-generated captions. To address these issues, we propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression. Specifically, we introduce an information-preserved stretching interpolation strategy that aligns long captions with fine-grained visual semantics in complex urban scenes. To effectively mine knowledge from LLM-generated captions and filter out noise, we propose a dual-level optimization strategy. At the data level, a multi-model collaboration pipeline automatically generates diverse and reliable captions without human intervention. At the model level, we employ a momentum-based self-distillation mechanism to generate stable pseudo-targets, facilitating robust cross-modal learning under noisy conditions. Extensive experiments across four real-world cities and various downstream tasks demonstrate the superior performance of our UrbanLN.
Problem

Research questions and friction points this paper is trying to address.

Aligning fine-grained urban visual features with long captions
Reducing noise in LLM-generated captions for urban imagery
Improving cross-modal learning for urban region representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-text awareness with fine-grained visual alignment
Dual-level optimization for noise suppression
Multi-model collaboration for reliable caption generation
🔎 Similar Papers
No similar papers found.
Y
Yimei Zhang
College of Computer Science and Technology, Zhejiang University of Technology
G
Guojiang Shen
College of Computer Science and Technology, Zhejiang University of Technology
K
Kaili Ning
College of Computer Science and Technology, Zhejiang University of Technology
Tongwei Ren
Tongwei Ren
Nanjing University
multimedia computing
X
Xuebo Qiu
College of Computer Science and Technology, Zhejiang University of Technology
M
Mengmeng Wang
College of Computer Science and Technology, Zhejiang University of Technology
X
Xiangjie Kong
College of Computer Science and Technology, Zhejiang University of Technology