What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses why overfitting can be avoided even when validation sets are repeatedly used for adaptive model selection. From an information bottleneck perspective, it presents the first empirical validation of the “compressibility implies generalization” hypothesis within automated machine learning agents. Employing a dual-agent architecture—comprising an explorer and a replicator—the work demonstrates across eight multimodal datasets (spanning tabular, vision, language, diffusion, and reward modeling tasks) that successful strategies exhibit low description length, with performance largely preserved under extreme compression of both outputs (via ultra-short prompts) and inputs (via 1-bit feedback). In contrast, when overfitting is artificially induced, short prompts fail to reproduce results, indicating that low-complexity strategy spaces are crucial for generalization. The causal link between compressibility and overfitting is further substantiated through a designed falsifiable experiment.

📝 Abstract

Reusing a held-out benchmark adaptively should, in principle, invite overfitting. Yet benchmark-driven machine learning (ML) has produced surprisingly little overfitting in practice. An attractive hypothesis is that successful ML strategies are highly compressible. We study this in the setting of LLM-driven research agents, where the hypothesis becomes directly testable via two complementary information bottlenecks. In \emph{output compression}, an exploration agent adaptively searches for high-performance models using a validation set, and we test whether a fresh ``reproducer agent'' can reproduce its performance given only an extremely short prompt and the training data. In \emph{input compression}, the explorer receives only one-bit feedback indicating whether each submitted model improves on the running best. Across 8 datasets spanning tabular classification, vision, language modeling, diffusion modeling, and reward modeling, we find that these bottlenecks have little effect on performance: short prompts and compressible feedback are sufficient to reproduce and find high-performance models. The hypothesis is falsifiable: when we deliberately induce validation-set overfitting, the results fail to reproduce with short prompts. Taken together, our results support a description-length explanation for the lack of overfitting in benchmark-driven ML: successful strategies occupy a low-complexity region of strategy space.

Problem

Research questions and friction points this paper is trying to address.

overfitting

generalization

compression

benchmark-driven ML

description length

Innovation

Methods, ideas, or system contributions that make the work stand out.

compression

generalization

overfitting