Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies “model collapse”—a phenomenon wherein models trained on unfiltered text embeddings (TEs) degenerate to predicting a single class, yielding spurious high accuracy and false correlations in downstream tasks. To quantify collapse severity, we propose the first dedicated metric and demonstrate that TE quality serves as an effective proxy for data cleaning. Through controlled experiments comparing model performance on original tabular data versus their TE representations, we show that uncurated TEs consistently induce collapse, severely impairing out-of-distribution generalization. Our study is the first to systematically reveal the critical impact of TE quality on learning robustness. It establishes a reproducible evaluation framework and provides principled data filtering criteria for embedding-driven modeling—bridging a key gap between embedding usage and reliable machine learning practice.

Technology Category

Application Category

📝 Abstract
Training models on uncurated Text Embeddings (TEs) derived from raw tabular data can lead to a severe failure mode known as model collapse, where predictions converge to a single class regardless of input. By comparing models trained with identical hyper-parameter configurations on both raw tabular data and their TE-derived counterparts, we find that collapse is a consistent failure mode in the latter setting. We introduce a set of metrics that capture the extent of model collapse, offering a new perspective on TE quality as a proxy for data curation. Our results reveal that TE alone does not effectively function as a curation layer - and that their quality significantly influences downstream learning. More insidiously, we observe that the presence of model collapse can yield artificially inflated and spurious Accuracy-on-the-Line correlation. These findings highlight the need for more nuanced curation and evaluation of embedding-based representations, particularly in out-of-distribution settings.
Problem

Research questions and friction points this paper is trying to address.

Model collapse occurs when training on uncurated text embeddings
Uncurated text embeddings lead to spurious performance prediction
Text embedding quality significantly impacts downstream learning outcomes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Metrics to quantify model collapse extent
TE quality as data curation proxy
Highlight spurious Accuracy-on-the-Line correlation
🔎 Similar Papers
No similar papers found.