π€ AI Summary
To address the weak generalization and misalignment between pretraining and inference in two-stage urban profiling paradigms (representation learning followed by linear probing), this paper proposes UrbanMDTβthe first unified, single-stage framework that jointly optimizes pretraining and inference. Its core innovation is the introduction of masked diffusion modeling into urban representation learning, realized via the Urban Masked Diffusion Transformer. This architecture integrates representation alignment with intermediate-feature regularization, incorporating classical urban analytical priors to enable end-to-end contextual learning over non-linguistic, structured urban data. UrbanMDT achieves significant improvements over state-of-the-art two-stage methods across two cities and three fine-grained profiling metrics. Ablation studies confirm that the masked diffusion mechanism is particularly critical for enhancing distributional prediction performance.
π Abstract
Urban profiling aims to predict urban profiles in unknown regions and plays a critical role in economic and social censuses. Existing approaches typically follow a two-stage paradigm: first, learning representations of urban areas; second, performing downstream prediction via linear probing, which originates from the BERT era. Inspired by the development of GPT style models, recent studies have shown that novel self-supervised pretraining schemes can endow models with direct applicability to downstream tasks, thereby eliminating the need for task-specific fine-tuning. This is largely because GPT unifies the form of pretraining and inference through next-token prediction. However, urban data exhibit structural characteristics that differ fundamentally from language, making it challenging to design a one-stage model that unifies both pretraining and inference. In this work, we propose Urban In-Context Learning, a framework that unifies pretraining and inference via a masked autoencoding process over urban regions. To capture the distribution of urban profiles, we introduce the Urban Masked Diffusion Transformer, which enables each region' s prediction to be represented as a distribution rather than a deterministic value. Furthermore, to stabilize diffusion training, we propose the Urban Representation Alignment Mechanism, which regularizes the model's intermediate features by aligning them with those from classical urban profiling methods. Extensive experiments on three indicators across two cities demonstrate that our one-stage method consistently outperforms state-of-the-art two-stage approaches. Ablation studies and case studies further validate the effectiveness of each proposed module, particularly the use of diffusion modeling.