🤖 AI Summary
High-energy physics (HEP) foundation models suffer from a lack of pretraining on real experimental data, relying instead on simulated datasets that exhibit domain gaps with actual collider observations.
Method: We introduce AOJs—the first open-source, ML-ready jet foundation model dataset derived entirely from real CMS 2016 LHC collision data, comprising 178 million high-pₜ jets—and propose OmniJet-α, a novel architecture for self-supervised pretraining directly on real jet data. This marks the first departure from simulation-based pretraining in HEP foundation modeling.
Contribution/Results: Our real-data pretraining significantly improves cross-domain generalization in generation tasks subject to substantial domain shift (e.g., boosted top vs. QCD jet synthesis). Transfer evaluation on simulation-based benchmarks such as JetClass demonstrates marked gains in generation fidelity and distributional alignment. This work establishes the first LHC real-data-driven foundation model pretraining paradigm and releases a standardized, publicly accessible dataset—providing critical infrastructure and a methodological blueprint for AI–HEP interdisciplinary research.
📝 Abstract
Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets (AOJs) dataset, consisting of approximately 178 M high pT jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet-α foundation model on AOJs improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton–proton collision data, we provide the ML-ready derived AOJs dataset for further public use.