🤖 AI Summary
This work addresses the lack of a unified criterion for selecting synthetic data generators in pretraining time series foundation models, where the optimal generator varies across model architectures, leading to unstable performance. Rather than treating this as a single-generator selection problem, the study reframes it as a corpus composition challenge and proposes constructing pretraining corpora by equally mixing multiple synthetic generators, further refined through integration with real data. Experiments training Chronos-T5-Mini and Moirai-Small from scratch demonstrate that this mixed-generation strategy matches or surpasses the best individual generator on both architectures; when combined with real data, it achieves overall superior pretraining performance. These results validate that effective corpus composition strategies must be tailored to specific model families.
📝 Abstract
Choosing the wrong synthetic generator for time-series foundation model pretraining is costly: under identical training budgets, the best and worst generators produce up to a $2\times$ gap in forecasting error, yet the field has no principled way to make this choice. The problem is compounded by the fact that generator rankings are not stable across architectures: across 11 generator families evaluated on Chronos-T5-Mini and Moirai-Small trained from scratch, we find that which generators are useful depends on the model architecture. Rather than solving the generator selection problem, we sidestep it: a simple equal-weight mixture of all generators matches or beats the best individual generator for both architectures, and composing this mixture with real data yields the strongest pretraining corpora overall. Synthetic pretraining is therefore a corpus composition problem, not a generator selection problem, and composition choices should be validated per model family rather than assumed to transfer.