🤖 AI Summary
Balancing privacy preservation and synthetic data utility remains challenging, particularly for sensitive time-series data (e.g., healthcare, finance). Method: This paper proposes Pub2Priv—a novel framework that leverages publicly available, non-sensitive contextual metadata (e.g., weather, electricity prices) to guide the generation of private time-series data. It introduces a self-attention mechanism to distill heterogeneous public knowledge into joint temporal and feature embeddings, which condition a diffusion-based generative model. Additionally, it proposes a new identifiability-based metric for rigorous privacy evaluation. Contribution/Results: Extensive experiments on multiple real-world datasets demonstrate that Pub2Priv significantly outperforms state-of-the-art methods. Crucially, it achieves high statistical fidelity and strong downstream task performance—even under stringent differential privacy guarantees (ε < 1.5). The framework establishes a scalable, verifiable paradigm for cross-domain secure data sharing.
📝 Abstract
Sharing sensitive time series data in domains such as finance, healthcare, and energy consumption, such as patient records or investment accounts, is often restricted due to privacy concerns. Privacy-aware synthetic time series generation addresses this challenge by enforcing noise during training, inherently introducing a trade-off between privacy and utility. In many cases, sensitive sequences is correlated with publicly available, non-sensitive contextual metadata (e.g., household electricity consumption may be influenced by weather conditions and electricity prices). However, existing privacy-aware data generation methods often overlook this opportunity, resulting in suboptimal privacy-utility trade-offs. In this paper, we present Pub2Priv, a novel framework for generating private time series data by leveraging heterogeneous public knowledge. Our model employs a self-attention mechanism to encode public data into temporal and feature embeddings, which serve as conditional inputs for a diffusion model to generate synthetic private sequences. Additionally, we introduce a practical metric to assess privacy by evaluating the identifiability of the synthetic data. Experimental results show that Pub2Priv consistently outperforms state-of-the-art benchmarks in improving the privacy-utility trade-off across finance, energy, and commodity trading domains.