Synthetic Data for Portfolios: A Throw of the Dice Will Never Abolish Chance

📅 2025-01-07

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Generative models exhibit fundamental limitations in financial portfolio construction and risk management—particularly statistical distortion from excessive synthetic data generation under small-sample regimes and neglect of critical portfolio characteristics such as long–short structure. Method: We propose an integrated multivariate return synthesis framework that jointly ensures statistical validity, adherence to stylized financial facts, and portfolio utility; further, we introduce a task-oriented identifiability evaluation paradigm based on “regurgitative training,” enabling precise assessment of generative models’ portfolio construction capability. Contribution/Results: Validated on large-scale U.S. equity data, our approach satisfies conventional evaluation metrics and market stylized facts while effectively exposing spurious high-performing models that fail under long–short strategies. It establishes both theoretical foundations and practical guidelines for the trustworthy deployment of generative AI in financial decision-making.

Technology Category

Application Category

📝 Abstract

Simulation methods have always been instrumental in finance, and data-driven methods with minimal model specification, commonly referred to as generative models, have attracted increasing attention, especially after the success of deep learning in a broad range of fields. However, the adoption of these models in financial applications has not matched the growing interest, probably due to the unique complexities and challenges of financial markets. This paper contributes to a deeper understanding of the limitations of generative models, particularly in portfolio and risk management. To this end, we begin by presenting theoretical results on the importance of initial sample size, and point out the potential pitfalls of generating far more data than originally available. We then highlight the inseparable nature of model development and the desired uses by touching on a paradox: usual generative models inherently care less about what is important for constructing portfolios (in particular the long-short ones). Based on these findings, we propose a pipeline for the generation of multivariate returns that meets conventional evaluation standards on a large universe of US equities while being compliant with stylized facts observed in asset returns and turning around the pitfalls we previously identified. Moreover, we insist on the need for more accurate evaluation methods, and suggest, through an example of mean-reversion strategies, a method designed to identify poor models for a given application based on regurgitative training, i.e. retraining the model using the data it has itself generated, which is commonly referred to in statistics as identifiability.

Problem

Research questions and friction points this paper is trying to address.

Understanding limitations of generative models in finance

Addressing pitfalls in synthetic data generation for portfolios

Proposing accurate evaluation methods for financial models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative models for multivariate financial returns

Pipeline compliant with stylized asset return facts

Regurgitative training for model evaluation

🔎 Similar Papers

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)