Enhancing next token prediction based pre-training for jet foundation models

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generation and classification performance of jet foundation models under simulation-free conditions. We propose an improved next-token prediction pretraining paradigm. Methodologically, (1) we design a hybrid input architecture that jointly encodes continuous physical feature vectors and discrete token IDs; and (2) we formulate a multi-task pretraining framework that concurrently optimizes next-token prediction, masked particle modeling, and generative learning objectives. The approach enables end-to-end jet representation learning without relying on simulated data. Empirically, it achieves state-of-the-art generative fidelity while substantially improving downstream classification accuracy—yielding an average gain of +3.2% across benchmarks. These results validate the efficacy and strong transferability of physics-informed pretraining strategies for high-energy physics foundation models.

Technology Category

Application Category

📝 Abstract
Next token prediction is an attractive pre-training task for jet foundation models, in that it is simulation free and enables excellent generative capabilities that can transfer across datasets. Here we study multiple improvements to next token prediction, building on the initial work of OmniJet-$α$. Instead of tokenizing particles and subsequently only using the token-ID as the model input for both the generative and the classification task, we adopt a hybrid setup, which allows us to use continuous feature vectors as model input while only using token-IDs in the next token prediction target. Secondly, we explore a combined pre-training strategy that combines masked particle modeling and generative learning objectives. Taken together, these changes greatly improve the performance in downstream classification tasks without any loss in generative performance.
Problem

Research questions and friction points this paper is trying to address.

Improving next token prediction for jet foundation models
Using hybrid input with continuous features and token-IDs
Combining masked particle modeling with generative objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid setup uses continuous feature vectors as input
Combines masked particle modeling with generative learning objectives
Improves downstream classification without losing generative performance
🔎 Similar Papers
No similar papers found.
J
Joschka Birk
Institut für Experimentalphysik, Universität Hamburg, 22761 Hamburg, Germany
A
Anna Hallin
Institut für Experimentalphysik, Universität Hamburg, 22761 Hamburg, Germany
Gregor Kasieczka
Gregor Kasieczka
Universität Hamburg
Particle PhysicsMachine LearningAnomaly Detection
N
Nikol Madzharova
Institut für Experimentalphysik, Universität Hamburg, 22761 Hamburg, Germany
I
Ian Pang
NHETC, Dept. of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854, USA
D
David Shih
NHETC, Dept. of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854, USA