Temporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the semantic alignment challenge in non-visual time-series-to-high-fidelity-image generation. We propose TimeArtist, a novel framework that introduces two key innovations: (1) a “warm-up–alignment” two-stage training paradigm—first performing self-supervised pretraining via dual autoencoders with a shared vector quantizer, then freezing encoders and introducing a representation-layer projection to achieve time-vision cross-modal semantic alignment; and (2) a transferable spatial prior unified architecture that maps temporal dynamics into controllable visual styles. Experiments demonstrate state-of-the-art performance on image quality metrics (e.g., FID, LPIPS) and achieve SOTA results on downstream zero-shot time-series forecasting tasks. These findings validate the effectiveness of both cross-modal semantic alignment and generative representation transfer.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into "pseudo-images" for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a "warmup-align" paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the representation level. TimeArtist establishes a versatile cross-modal framework, enabling high-quality, diverse image generation directly from time series, while capturing temporal fluctuation patterns to render images as styles transfer. Extensive experiments show that TimeArtist achieves satisfactory performance in image generation metrics, while also attaining superior results in zero-shot temporal tasks. Our work establishes a new paradigm for cross-modal generation, bridging the gap between temporal dynamics and visual semantics.
Problem

Research questions and friction points this paper is trying to address.

Transferring spatial priors from vision models to temporal tasks
Establishing semantic alignment between time series and visual concepts
Enabling high-fidelity image generation directly from temporal data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic alignment between time series and visual concepts
Dual-autoencoder with shared quantizer for cross-modal learning
Frozen encoders with projection enable zero-shot temporal tasks
🔎 Similar Papers
Xiangkai Ma
Xiangkai Ma
Nanjing University
Time seriesMulti modalityVision language action
H
Han Zhang
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
W
Wenzhong Li
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
S
Sanglu Lu
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China