Aspen Open Jets: unlocking LHC data for foundation models in particle physics

📅 2024-12-13
🏛️ Machine Learning: Science and Technology
📈 Citations: 10
Influential: 0
📄 PDF
🤖 AI Summary
High-energy physics (HEP) foundation models suffer from a lack of pretraining on real experimental data, relying instead on simulated datasets that exhibit domain gaps with actual collider observations. Method: We introduce AOJs—the first open-source, ML-ready jet foundation model dataset derived entirely from real CMS 2016 LHC collision data, comprising 178 million high-pₜ jets—and propose OmniJet-α, a novel architecture for self-supervised pretraining directly on real jet data. This marks the first departure from simulation-based pretraining in HEP foundation modeling. Contribution/Results: Our real-data pretraining significantly improves cross-domain generalization in generation tasks subject to substantial domain shift (e.g., boosted top vs. QCD jet synthesis). Transfer evaluation on simulation-based benchmarks such as JetClass demonstrates marked gains in generation fidelity and distributional alignment. This work establishes the first LHC real-data-driven foundation model pretraining paradigm and releases a standardized, publicly accessible dataset—providing critical infrastructure and a methodological blueprint for AI–HEP interdisciplinary research.

Technology Category

Application Category

📝 Abstract
Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets (AOJs) dataset, consisting of approximately 178 M high pT jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet-α foundation model on AOJs improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton–proton collision data, we provide the ML-ready derived AOJs dataset for further public use.
Problem

Research questions and friction points this paper is trying to address.

Pre-training foundation models using CMS LHC data for particle physics applications
Improving generative performance on domain-shifted jet classification tasks
Providing publicly available ML-ready jet dataset from proton-proton collisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained foundation model on CMS Open Data
AspenOpenJets dataset with 178M high-pT jets
Improved generative performance on domain-shifted tasks
🔎 Similar Papers
No similar papers found.
O
Oz Amram
Fermi National Accelerator Laboratory, Batavia, IL 60510, USA
L
Luca Anzalone
Department of Physics and Astronomy (DIFA), University of Bologna, 40127 Bologna, Italy
J
Joschka Birk
Institut für Experimentalphysik, Universität Hamburg, 22761 Hamburg, Germany
D
D. Faroughy
NHETC, Dept. of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854, USA
A
Anna Hallin
Institut für Experimentalphysik, Universität Hamburg, 22761 Hamburg, Germany
G
G. Kasieczka
Institut für Experimentalphysik, Universität Hamburg, 22761 Hamburg, Germany
Michael Krämer
Michael Krämer
Professor of Theoretical Physics, RWTH Aachen University
Theoretical particle and astroparticle physicsmachine learningphilosophy of science
I
Ian Pang
NHETC, Dept. of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854, USA
H
H. Reyes-González
Institut für Theoretische Teilchenphysik und Kosmologie, RWTH Aachen University, 52074 Aachen, Germany
D
David Shih
NHETC, Dept. of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854, USA