🤖 AI Summary
This work addresses the critical challenge of quark/gluon jet discrimination in high-energy physics by pioneering a systematic investigation into the end-to-end classification capability of vision Transformers (ViTs) directly applied to multi-channel calorimeter images—comprising electromagnetic (ECAL), hadronic (HCAL), and tracking layers. We propose two novel architectures: pure ViT variants and hybrid ViT-CNN models (ViT+MaxViT, ViT+ConvNeXt), explicitly capturing long-range substructure correlations within jets. Evaluated on the publicly available CMS 2012 simulated dataset—incorporating realistic detector response and pile-up noise—our models achieve significant improvements over CNN baselines in F1-score, ROC-AUC, and accuracy. This study establishes the first ViT-based jet classification benchmark grounded in open collider data and releases a structured, multi-channel jet image dataset. It thus introduces a new paradigm for deep learning in high-energy physics, enabling more effective modeling of complex, long-range jet topologies.
📝 Abstract
Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.