Can Generalist Agents Automate Data Curation?

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the high reliance on manual effort and costly iterative processes in data curation by introducing Curation-Bench, the first automated benchmark for general-purpose agents in data curation. The framework grants coding agents command-line access to inspect datasets, implement curation strategies, and integrate into a fixed training–evaluation pipeline. It further incorporates a guided scaffolding mechanism that encourages agents to perform global exploration based on existing methods rather than local fine-tuning. In vision–language instruction fine-tuning tasks, a basic agent achieves performance on par with strong baselines within ten iterations; when augmented with scaffolding, the agent autonomously generates curation strategies that surpass these baselines using only one-tenth of the original data volume—demonstrating, for the first time, high-performance data selection without human intervention.

📝 Abstract

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

Problem

Research questions and friction points this paper is trying to address.

data curation

generalist agents

automation

data policy

agent-based automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

generalist agents

data curation

Curation-Bench