🤖 AI Summary
This work investigates the evolutionary stability and user-preference drift of generative models under self-consuming training loops subject to noise and adversarial data curation by malicious users. We propose the first theoretical framework characterizing how adversarial data poisoning degrades training robustness, and establish a quantitative relationship between preference drift and poisoning intensity. We design a gradient-based black-box attack algorithm with provable convergence guarantees, and model platform manipulation mechanisms via game-theoretic analysis and distributional shift theory. Experiments on real and synthetic data—including CIFAR-10 and Reddit text—demonstrate that our attack induces over 68% preference drift in target models within limited query budgets, significantly outperforming baselines. Moreover, our derived stability conditions empirically ensure robust training against such poisoning.
📝 Abstract
Recent advances in generative models have made it increasingly difficult to distinguish real data from model-generated synthetic data. Using synthetic data for successive training of future model generations creates"self-consuming loops", which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their preferences. Ferbach et al. (2024) recently showed that when data is curated according to user preferences, the self-consuming retraining loop drives the model to converge toward a distribution that optimizes those preferences. However, in practice, data curation is often noisy or adversarially manipulated. For example, competing platforms may recruit malicious users to adversarially curate data and disrupt rival models. In this paper, we study how generative models evolve under self-consuming retraining loops with noisy and adversarially curated data. We theoretically analyze the impact of such noisy data curation on generative models and identify conditions for the robustness of the retraining process. Building on this analysis, we design attack algorithms for competitive adversarial scenarios, where a platform with a limited budget employs malicious users to misalign a rival's model from actual user preferences. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithms.