A Comprehensive Evaluation Framework for Synthetic Trip Data Generation in Public Transport

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the inherent trade-off between privacy preservation and data utility in smart-card transit data—and the lack of systematic evaluation frameworks for synthetic data—this paper introduces RPU, the first multi-dimensional, multi-level synthetic data assessment framework tailored for public transportation. RPU uniformly quantifies generation quality across three dimensions—representativeness, privacy, and utility—and three granularities: record-, group-, and population-level. It integrates statistical similarity metrics, re-identification risk analysis, and downstream task utility evaluation to empirically assess 12 generative models, including CTGAN. Results reveal no universally optimal synthetic data generator; CTGAN achieves the best privacy–utility trade-off. Crucially, the study refutes the misconception that “synthetic implies private,” demonstrating that privacy guarantees cannot be assumed a priori. RPU thus establishes an interpretable, reproducible benchmark for model selection and evidence-based policy formulation in transit analytics.

Technology Category

Application Category

📝 Abstract
Synthetic data offers a promising solution to the privacy and accessibility challenges of using smart card data in public transport research. Despite rapid progress in generative modeling, there is limited attention to comprehensive evaluation, leaving unclear how reliable, safe, and useful synthetic data truly are. Existing evaluations remain fragmented, typically limited to population-level representativeness or record-level privacy, without considering group-level variations or task-specific utility. To address this gap, we propose a Representativeness-Privacy-Utility (RPU) framework that systematically evaluates synthetic trip data across three complementary dimensions and three hierarchical levels (record, group, population). The framework integrates a consistent set of metrics to quantify similarity, disclosure risk, and practical usefulness, enabling transparent and balanced assessment of synthetic data quality. We apply the framework to benchmark twelve representative generation methods, spanning conventional statistical models, deep generative networks, and privacy-enhanced variants. Results show that synthetic data do not inherently guarantee privacy and there is no "one-size-fits-all" model, the trade-off between privacy and representativeness/utility is obvious. Conditional Tabular generative adversarial network (CTGAN) provide the most balanced trade-off and is suggested for practical applications. The RPU framework provides a systematic and reproducible basis for researchers and practitioners to compare synthetic data generation techniques and select appropriate methods in public transport applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reliability and safety of synthetic public transport data
Addressing fragmented assessments in representativeness, privacy and utility
Systematically comparing synthetic trip data generation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed RPU framework for synthetic trip data evaluation
Integrated metrics for similarity, risk, and usefulness assessment
Benchmarked twelve generation methods including CTGAN
🔎 Similar Papers
No similar papers found.