🤖 AI Summary
High-resolution ground subsidence exhibits strong nonlinearity and long-range spatiotemporal dependencies, which challenge existing unimodal models (e.g., ConvLSTM) in achieving accurate modeling. To address this, we propose a multimodal spatiotemporal Transformer framework that, for the first time, jointly encodes dynamic InSAR displacement sequences with static physical priors—including geological and hydrological parameters—via a unified spatiotemporal attention mechanism. This design overcomes the inherent limitations of unimodal paradigms and significantly enhances long-term temporal dependency modeling. Evaluated on the EGMS benchmark dataset, our method reduces RMSE for long-horizon forecasting by an order of magnitude compared to state-of-the-art approaches such as STGCN and STAEformer, achieving internationally competitive performance.
📝 Abstract
Forecasting high-resolution land subsidence is a critical yet challenging task due to its complex, non-linear dynamics. While standard architectures like ConvLSTM often fail to model long-range dependencies, we argue that a more fundamental limitation of prior work lies in the uni-modal data paradigm. To address this, we propose the Multi-Modal Spatio-Temporal Transformer (MM-STT), a novel framework that fuses dynamic displacement data with static physical priors. Its core innovation is a joint spatio-temporal attention mechanism that processes all multi-modal features in a unified manner. On the public EGMS dataset, MM-STT establishes a new state-of-the-art, reducing the long-range forecast RMSE by an order of magnitude compared to all baselines, including SOTA methods like STGCN and STAEformer. Our results demonstrate that for this class of problems, an architecture's inherent capacity for deep multi-modal fusion is paramount for achieving transformative performance.