11 Jul 2024 | Tobias Golling, Lukas Heinrich, Michael Kagan, Samuel Klein, Matthew Leigh, Margarita Osadchy, and John Andrew Raine
This paper introduces masked particle modeling (MPM) as a self-supervised learning method for high energy physics (HEP) data. MPM is designed to learn permutation-invariant representations of unordered sets of particles, enabling the development of generic, transferable, and reusable representations for HEP scientific data. The method involves masking particles in a set and training the model to recover their identity using a pre-trained vector quantized variational autoencoder (VQ-VAE). The approach is evaluated on high energy jet data from collider physics experiments, demonstrating its effectiveness in tasks such as jet classification and fine-tuning for new classes and data domains.
The MPM framework is inspired by masked language modeling (MLM) in natural language processing (NLP), where tokens are masked and the model is trained to predict them. In HEP, this is adapted to unordered sets of particles, where each particle is represented by a set of continuous features. The model uses a transformer encoder to process these inputs and predict the masked particles. The VQ-VAE is used to generate discrete tokens for the model, which are then used as targets during pre-training.
The paper explores different tokenization schemes, including discretization of particle features and the use of K-means clustering for label generation. It also investigates the impact of ordering particles in the input and the use of positional embeddings to maintain permutation invariance. The results show that using VQ-VAE tokens as targets leads to better performance compared to K-means tokens or direct regression of continuous features.
The MPM model is evaluated on various downstream tasks, including jet classification on new classes and data domains. The results demonstrate that pre-trained models can be effectively fine-tuned with small datasets to achieve high performance. The model is also tested on out-of-domain data, showing that pre-training on real data can help mitigate domain shifts between simulated and real experimental data.
The paper concludes that MPM provides a promising approach for developing foundation models in HEP, capable of learning generic representations that can be fine-tuned for a variety of downstream tasks. The method addresses the challenges of working with continuous and unordered data, and shows that self-supervised learning can be effective in HEP, potentially reducing the need for labeled data and improving model performance on new tasks and data domains.This paper introduces masked particle modeling (MPM) as a self-supervised learning method for high energy physics (HEP) data. MPM is designed to learn permutation-invariant representations of unordered sets of particles, enabling the development of generic, transferable, and reusable representations for HEP scientific data. The method involves masking particles in a set and training the model to recover their identity using a pre-trained vector quantized variational autoencoder (VQ-VAE). The approach is evaluated on high energy jet data from collider physics experiments, demonstrating its effectiveness in tasks such as jet classification and fine-tuning for new classes and data domains.
The MPM framework is inspired by masked language modeling (MLM) in natural language processing (NLP), where tokens are masked and the model is trained to predict them. In HEP, this is adapted to unordered sets of particles, where each particle is represented by a set of continuous features. The model uses a transformer encoder to process these inputs and predict the masked particles. The VQ-VAE is used to generate discrete tokens for the model, which are then used as targets during pre-training.
The paper explores different tokenization schemes, including discretization of particle features and the use of K-means clustering for label generation. It also investigates the impact of ordering particles in the input and the use of positional embeddings to maintain permutation invariance. The results show that using VQ-VAE tokens as targets leads to better performance compared to K-means tokens or direct regression of continuous features.
The MPM model is evaluated on various downstream tasks, including jet classification on new classes and data domains. The results demonstrate that pre-trained models can be effectively fine-tuned with small datasets to achieve high performance. The model is also tested on out-of-domain data, showing that pre-training on real data can help mitigate domain shifts between simulated and real experimental data.
The paper concludes that MPM provides a promising approach for developing foundation models in HEP, capable of learning generic representations that can be fine-tuned for a variety of downstream tasks. The method addresses the challenges of working with continuous and unordered data, and shows that self-supervised learning can be effective in HEP, potentially reducing the need for labeled data and improving model performance on new tasks and data domains.