pith. machine review for the scientific record. sign in

arxiv: 2604.21035 · v1 · submitted 2026-04-22 · ✦ hep-ph · hep-ex

Recognition: unknown

Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider

Ambre Visive, Clara Nellist, Polina Moskvitina, Roberto Ruiz de Austri, Sascha Caron

Pith reviewed 2026-05-09 23:21 UTC · model grok-4.3

classification ✦ hep-ph hep-ex
keywords anomaly detectionLarge Hadron Collidermasked token predictionvector-quantized autoencodersfour-top productionsupersymmetrytransformer architectureStandard Model background
0
0 comments X

The pith

Masked-token prediction trained solely on background events detects anomalous collider signatures by scoring deviations from learned Standard Model patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies masked-token prediction from language models to anomaly detection in high-energy physics. It trains a lightweight encoder exclusively on tokenized sequences of Standard Model background events to capture their structure. At inference time, sequences that produce high errors in predicting the masked tokens receive elevated anomaly scores. The method performs well on the four-top quark signature, which closely resembles background, and shows improved results when using vector-quantized variational autoencoder tokenization instead of lookup tables. Once trained on background, the model transfers to multiple beyond-Standard-Model searches without retraining.

Core claim

By representing collider events as sequences of tokens and training an encoder to predict masked tokens from background data alone, the method learns the patterns of Standard Model physics. Deviations in these predictions then serve as anomaly scores for potential new physics signals, without any signal-specific training. Evaluation on four-top quark production and supersymmetric gluino pair production shows effective detection, particularly with deep-learned tokenization.

What carries the argument

Masked-token prediction on tokenized event sequences, where the model learns to reconstruct masked tokens from background context and uses prediction error as the anomaly score.

If this is right

  • The model transfers across different beyond-Standard-Model searches after a single background-only training run.
  • Vector-quantized variational autoencoder tokenization improves detection performance over lookup table tokenization.
  • Strong results on the four-top signature demonstrate sensitivity to subtle deviations that resemble background.
  • The approach supports scalable, model-independent anomaly detection at reduced computational cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The importance of tokenization choice indicates that data representation learning is a key factor when adapting sequence models to physics events.
  • This method could extend to other sequential or high-dimensional datasets in particle physics where explicit feature engineering is costly.
  • The transferability across searches suggests potential for unified anomaly pipelines that scan large datasets for unexpected signals.

Load-bearing premise

That sequences of tokenized background events capture enough of the Standard Model physics structure for beyond-Standard-Model deviations to produce reliably higher anomaly scores without signal-specific training.

What would settle it

A controlled test in which known beyond-Standard-Model events receive anomaly scores statistically indistinguishable from those of Standard Model background events.

Figures

Figures reproduced from arXiv: 2604.21035 by Ambre Visive, Clara Nellist, Polina Moskvitina, Roberto Ruiz de Austri, Sascha Caron.

Figure 1
Figure 1. Figure 1: Visualisation of the procedure for a random event where the token of index 5 (397) is masked. – 5 – [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustrative distribution of anomaly scores in an ideal scenario. 3.3 VQ-VAE To obtain a learned discrete representation of continuous collider event features, we employ a VQ-VAE tokenizer [16] in the spirit of Ref. [17]. This allows continuous event information to be compressed into discrete token sequences suitable for the downstream masked-token anomaly detection model. Each event is represented as a pa… view at source ↗
Figure 3
Figure 3. Figure 3: ROC curves for the downstream models evaluated with each tokenization strategy, where LUT denotes the look-up table tokenization, shown for the ttt¯ t¯ scenario on the left and the ˜gg˜ scenario on the right. For the LUT scheme, the ROC curves obtained in the four-top benchmark exhibit non-monotonic behaviour, indicating partially overlapping score distributions. While this feature becomes less pro￾nounced… view at source ↗
Figure 4
Figure 4. Figure 4: ROC curves for the proposed method (labelled as ‘MaskedToken+VQVAE’ in the legend) and other established unsupervised methods from Ref. [20], shown for the ttt¯ t¯ scenario on the left and the ˜gg˜ scenario on the right. – 11 – [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Anomaly detection in High Energy Physics requires identifying rare signals against overwhelming backgrounds, without prior knowledge of the signal. We present the first application of masked-token prediction, a technique from Large Language Models, to this problem. A lightweight encoder architecture trained solely on background events captures the structure of Standard Model (SM) physics; at inference, sequences deviating from this learned structure are flagged as anomalous. We evaluate the approach on searches for four-top-quark production and supersymmetric gluino pair production, both featuring top-rich final states with substantial missing transverse energy, covering SM and beyond the Standard Model (BSM) scenarios. Strong performance on the four-top signature, which closely resembles background, demonstrates the method's sensitivity to subtle deviations. We further show that the tokenization strategy significantly impacts performance: deep-learned tokenization via vector-quantized variational autoencoders (VQ-VAE) outperforms look-up table tokenization. Comparison with established anomaly detection baselines confirms robustness. These results highlight the potential of token-based collider data representations combined with transformer architectures for new-physics discovery. Once trained on SM background, the model transfers across different BSM searches, enabling scalable, model-independent anomaly detection at reduced computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents the first application of masked-token prediction from large language models to anomaly detection in LHC collider data. A lightweight encoder is trained exclusively on Standard Model background events using tokenized event sequences; at inference, deviations from the learned structure yield anomaly scores. The approach is tested on four-top-quark production and supersymmetric gluino-pair production (both top-rich final states with substantial MET), with claims of strong performance on the four-top channel despite its background-like kinematics, superior results when using VQ-VAE tokenization versus lookup-table tokenization, transferability across BSM searches, and robustness relative to established anomaly-detection baselines.

Significance. If the quantitative results and validation tests support the claims, the work could introduce a scalable, model-independent anomaly-detection paradigm that repurposes transformer masked-prediction techniques for tokenized collider data. This would offer a route to efficient, signal-agnostic new-physics searches that avoid per-signal retraining and could lower computational overhead once the background model is learned.

major comments (2)
  1. Abstract: the abstract asserts 'strong performance on the four-top signature' and that 'comparison with established anomaly detection baselines confirms robustness,' yet supplies no quantitative metrics (AUC, significance, error bars), baseline values, or details on data selection, training procedure, or evaluation protocol. These omissions are load-bearing for the central claims of sensitivity to subtle deviations and cross-BSM transferability.
  2. Tokenization strategy: the paper states that VQ-VAE tokenization 'significantly impacts performance' and outperforms lookup tables, but provides no explicit test or analysis showing that the learned discrete codes preserve the continuous kinematic correlations (jet pT, MET, angular separations) required for the model to capture SM structure. Without such checks, the assumption that reconstruction errors on masked background sequences reliably flag subtle BSM deviations remains unverified and central to the method's validity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our results and validation.

read point-by-point responses
  1. Referee: Abstract: the abstract asserts 'strong performance on the four-top signature' and that 'comparison with established anomaly detection baselines confirms robustness,' yet supplies no quantitative metrics (AUC, significance, error bars), baseline values, or details on data selection, training procedure, or evaluation protocol. These omissions are load-bearing for the central claims of sensitivity to subtle deviations and cross-BSM transferability.

    Authors: We agree that the abstract would be strengthened by including quantitative metrics. In the revised manuscript we will update the abstract to report key AUC values for the four-top and gluino-pair channels, the improvement over baselines, and concise details on the data selection, training, and evaluation protocol while keeping the abstract concise. revision: yes

  2. Referee: Tokenization strategy: the paper states that VQ-VAE tokenization 'significantly impacts performance' and outperforms lookup tables, but provides no explicit test or analysis showing that the learned discrete codes preserve the continuous kinematic correlations (jet pT, MET, angular separations) required for the model to capture SM structure. Without such checks, the assumption that reconstruction errors on masked background sequences reliably flag subtle BSM deviations remains unverified and central to the method's validity.

    Authors: We acknowledge that an explicit verification of correlation preservation would strengthen the validation of the tokenization step. Although comparative performance results are already presented, we will add a new subsection and accompanying figure in the revision that directly compares kinematic distributions and correlation matrices (jet pT, MET, angular separations) between the original continuous variables and the VQ-VAE tokenized representations to confirm that the essential SM structure is retained. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper applies masked-token prediction (an established LLM technique) to HEP anomaly detection by training a lightweight encoder solely on background events to learn SM structure, then scoring anomalies via deviation from predicted tokens at inference. This is a direct, non-circular transfer of the method: the anomaly score is the reconstruction error under masking, not a quantity defined in terms of itself or a fitted parameter renamed as a prediction. Tokenization (VQ-VAE vs. lookup table) is compared as an independent design choice with external baselines, and transfer across BSM searches follows from the background-only training without self-referential loops or load-bearing self-citations. No self-definitional steps, ansatz smuggling, or renaming of known results appear; the chain remains self-contained against standard anomaly detection benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions of machine learning applied to physics data plus domain-specific modeling of SM background as learnable token sequences. No new physical entities are postulated.

free parameters (2)
  • model architecture hyperparameters
    Number of layers, attention heads, embedding dimension, and training hyperparameters for the lightweight encoder are chosen but not enumerated in the abstract.
  • VQ-VAE codebook size and training parameters
    The vector-quantized variational autoencoder used for learned tokenization requires multiple hyperparameters whose values are not reported.
axioms (2)
  • domain assumption Background events can be represented as sequences whose statistical structure is learnable by a masked-token objective without explicit physics modeling.
    Invoked in the description of training solely on SM background to capture its structure.
  • domain assumption Anomalous BSM events will produce measurable increases in prediction error under the learned background model.
    Core premise for flagging deviations at inference time.

pith-pipeline@v0.9.0 · 5527 in / 1422 out tokens · 65527 ms · 2026-05-09T23:21:02.771141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    Radford, K

    A. Radford, K. Narasimhan, T. Salimans and I. Sutskever, Improving language understanding by generative pre-training, 2018

  2. [2]

    Large Language Models: A Survey

    S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain et al., Large Language Models: A Survey,2402.06196

  3. [3]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez et al., Attention Is All You Need, CoRRabs/1706.03762(2023) [1706.03762]

  4. [4]

    Bender, T

    E.M. Bender, T. Gebru, A. McMillan-Major and S. Shmitchell, On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Association for Computer Machinery – ACM, Mar., 2021, DOI

  5. [5]

    The ATLAS and CMS Collaborations, Highlights of the HL-LHC physics projections by ATLAS and CMS,2504.00672

  6. [6]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,1810.04805

  7. [7]

    Golling, L

    T. Golling, L. Heinrich, M. Kagan, S. Klein, M. Leigh, M. Osadchy et al., Masked particle modeling on sets: towards self-supervised high energy physics foundation models, Mach. Learn. Sci. Tech.5(2024) 035074 [2401.13537]

  8. [8]

    Ramprasad et al., Large Language Models in Science, arXiv:2501.05382 (2025)

    K.G. Barman et al., Large physics models: towards a collaborative approach with large language models and foundation models, Eur. Phys. J. C85(2025) 1066 [2501.05382]

  9. [9]

    Builtjes, S

    L. Builtjes, S. Caron, P. Moskvitina, C. Nellist, R.R. de Austri, R. Verheyen et al., Attention to the strengths of physical interactions: Transformer and graph-based event classification for particle physics experiments, SciPost Phys.19(2025) 028

  10. [10]

    Aarrestad, M

    T. Aarrestad, M. van Beekveld, M. Bona, A. Boveia, S. Caron, J. Davies et al., The Dark Machines Anomaly Score Challenge: Benchmark Data and Model Independent Event Classification for the Large Hadron Collider, SciPost Physics12(2022)

  11. [11]

    Caron, R.R

    S. Caron, R.R. de Austri and Z. Zhang, Mixture-of-Theories training: can we find new physics and anomalies better by mixing physical theories?, JHEP03(2023) 004 [2207.07631]

  12. [12]

    Caron, L

    S. Caron, L. Hendriks and R. Verheyen, Rare and Different: Anomaly Scores from a combination of likelihood and out-of-distribution models to detect new physics at the LHC, SciPost Phys.12(2022) 077 [2106.10164]

  13. [13]

    J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen and Y. Liu, RoFormer: Enhanced Transformer with Rotary Position Embedding,2104.09864

  14. [14]

    X. Chu, Z. Tian, B. Zhang, X. Wang and C. Shen, Conditional Positional Encodings for Vision Transformers,2102.10882

  15. [15]

    Adam: A Method for Stochastic Optimization

    D.P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization,1412.6980

  16. [16]

    ://arxiv.org/abs/1711.00937

    A. van den Oord, O. Vinyals and K. Kavukcuoglu, Neural Discrete Representation Learning, 1711.00937

  17. [17]

    J. Birk, A. Hallin and G. Kasieczka, OmniJet-α: the first cross-task foundation model for particle physics, Machine Learning: Science and Technology5(2024) 035031. – 24 –

  18. [18]

    Normformer: Improved transformer pretraining with extra normalization

    S. Shleifer, J. Weston and M. Ott, NormFormer: Improved Transformer Pretraining with Extra Normalization,2110.09456

  19. [19]

    Zhang, F

    J. Zhang, F. Zhan, C. Theobalt and S. Lu, Regularized Vector Quantization for Tokenized Image Synthesis,2303.06424

  20. [20]

    Caron, J

    S. Caron, J. Garc´ ıa Navarro, M. Moreno Ll´ acer, P. Moskvitina, M. Rovers, A. Rubio J´ ımenez et al., Universal anomaly detection at the LHC: transforming optimal classifiers and the DDD method, Eur. Phys. J. C85(2025) 415 [2406.18469]. – 25 –