pith. machine review for the scientific record. sign in

arxiv: 2605.09881 · v1 · submitted 2026-05-11 · ✦ hep-ph · cs.LG· hep-ex

Recognition: 1 theorem link

· Lean Theorem

Dissecting Jet-Tagger Through Mechanistic Interpretability

Sanmay Ganguly, Saurabh Rai

Pith reviewed 2026-05-12 04:45 UTC · model grok-4.3

classification ✦ hep-ph cs.LGhep-ex
keywords jet taggingmechanistic interpretabilityparticle transformertop quark taggingenergy correlatorsattention headsresidual streamsubstructure
0
0 comments X

The pith

A six-head circuit in a particle transformer recovers most top-quark jet tagging performance through a source-relay-readout structure aligned with energy correlators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies mechanistic interpretability tools to a Particle Transformer trained for top-quark jet tagging. It locates a sparse six-head subnetwork that preserves the great majority of the full model's accuracy while exposing a clean internal flow: one early head sources the signal, middle heads relay by attending to pairwise substructure, and a late head reads out the aggregated decision. Linear probes on the residual stream show stronger alignment with energy correlator observables than with N-subjettiness, and a preference for two-prong over three-prong features. An apparent early classification commitment turns out to be a basis rotation after the discriminating signal has already saturated inside the particle attention stack. These findings matter because they indicate that gradient descent on jet data can rediscover physically meaningful structures without explicit supervision, allowing the same interpretability methods used on language models to be transferred to high-energy physics classifiers.

Core claim

Combining zero ablation, path patching with two complementary on-manifold corruption strategies, and linear probing of the residual stream, the authors identify a sparse six-head circuit that recovers the great majority of the full model performance while admitting a clean source-relay-readout interpretation. In this circuit a single early-layer head serves as the primary causal source, a cluster of middle-layer heads acts as relays selectively attending to hard pairwise substructure, and a single late-layer head reads out the aggregated signal. Linear probes show that the residual stream is preferentially aligned with the energy correlator basis over the N-subjettiness basis, with stronger

What carries the argument

The sparse six-head circuit recovered by zero ablation and path patching, which implements source-relay-readout flow and aligns the residual stream with energy correlator observables.

If this is right

  • The bulk of the classification decision can be explained by a small, human-readable subgraph rather than the full network.
  • The internal representations encode two-prong jet substructure observables drawn from the energy correlator family.
  • The model performs an internal basis rotation early in the network rather than committing to a classification in a single step.
  • Gradient descent applied to jet data can rediscover physically meaningful structures without explicit supervision.
  • Mechanistic interpretability methods developed for language models transfer directly to jet-physics classifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The discovered circuit could be used as a starting point for constructing lighter, more verifiable jet taggers that implement the same logic explicitly.
  • The preference for energy correlators suggests the model exploits QCD symmetries that might be missed by architectures trained only on N-subjettiness variables.
  • Repeating the analysis on other tagging datasets or architectures would test whether different training regimes produce different circuits or the same energy-correlator alignment.

Load-bearing premise

The chosen on-manifold corruption strategies and zero-ablation interventions isolate the true causal circuit without introducing new artifacts that the model was never trained to handle.

What would settle it

Ablating every head outside the six identified ones and checking whether the performance drop is as small as claimed, or running linear probes and verifying that alignment with energy correlator features exceeds alignment with N-subjettiness features.

Figures

Figures reproduced from arXiv: 2605.09881 by Sanmay Ganguly, Saurabh Rai.

Figure 1
Figure 1. Figure 1: Schematic overview of the analysis pipeline. A boosted top jet ( [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Logit-lens AUC as a function of representation depth. At each depth, the mean [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean logit difference ⟨log ptop − log pQCD⟩ as a function of representation depth, evaluated separately for top jets (red) and QCD jets (blue) over 10,000 test jets. Through all particle attention layers the two classes have similar and near-zero logit differences. A sharp class separation emerges only at the first class attention block (Cls0). Read at face value, these observations suggest that no class-d… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of logit-lens AUC (blue circles) and per-layer trained logistic probe [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Head importance measured by zero ablation, averaged over five independent training [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Direct effect (path-patching recovery score) for all 16 heads. Heads not in the [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The identified circuit. Nodes represent the six circuit heads; node color encodes [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Partial minimality test. Test AUC of the partial circuit as heads are added one at [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Random-baseline comparison. Test AUC distribution for 200 randomly sampled [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Linear probe R2 for jet-level observables as a function of representation depth, including both particle attention layers (L0–L4) and class attention blocks (Cls0, Cls1). Jet mass and the leading-pT fraction are encoded with high R2 from the embedding layer; τ32 and τ21 improve more gradually through the particle attention stack. The binary is top probe shows only modest improvement between L4 and the cla… view at source ↗
Figure 11
Figure 11. Figure 11: Linear probe R2 for energy correlator observables as a function of representation depth. Left: the 3-prong observables C (β=1) 3 (red) and N (β=1) 3 (orange) against the N￾subjettiness reference τ32 (grey, dashed). Right: the same 3-prong observables compared to the 2-prong observables C (β=1) 2 (blue, dashed) and D (β=1) 2 (green, dashed). The shaded region marks the class attention blocks. The 3-prong o… view at source ↗
Figure 12
Figure 12. Figure 12: Linear probe R2 for D (β=1) 2 (blue) and τ32 (red) in raw form (left) and after residualization against jet mass (right). The dashed grey line on the left panel shows the jet mass probe R2 for reference. The D2-over-τ32 encoding advantage is preserved, and indeed widens slightly, after the contribution from jet mass has been removed. 8.2 Causal feature ablation The Pearson correlations of Section 8.1 are … view at source ↗
Figure 13
Figure 13. Figure 13: Causal feature ablation. For each of the four pairwise features (coloured dashed [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Class-discriminating powers δ2 (blue) and δ3 (red) for the six circuit heads, computed from the Pearson correlations between the head’s attention weights and the pairwise proxies of Eqs. (3)–(4) on top and QCD jets over 2,000 test jets. For L0H2, L1H0, L1H3, and L3H3, the 3-prong discriminating power exceeds the 2-prong discriminating power; for L0H1 and L1H1, both are negligible. 9 Discussion In this sec… view at source ↗
read the original abstract

Mechanistic interpretability seeks to reverse engineer a trained neural network by identifying the minimal subset of internal components. We perform a mechanistic interpretability analysis of the Particle Transformer architecture, trained on the Top Quark Tagging reference dataset, with the goal of identifying the computational circuit responsible for jet classification and characterizing the physical content of its internal representations. Combining zero ablation, path patching with two complementary on-manifold corruption strategies and linear probing of the residual stream, we identify a sparse six-head circuit that recovers the great majority of the full model performance while admitting a clean source-relay-readout interpretation. In this circuit, a single early layer head serves as the primary causal source, a cluster of middle-layer heads acts as relays selectively attending to hard pairwise substructure and a single late-layer head reads out the aggregated signal. Linear probes show that the residual stream is preferentially aligned with the energy correlator basis over the $N$-subjettiness basis. Within the energy correlator basis, the model preferentially encodes 2-prong substructure observables over the 3-prong observables. A per-layer trained probe further reveals that the apparent single step commitment of the model to a classification decision in the first class attention block is in fact a basis rotation, with the discriminating signal already saturating in the particle attention stack. These results demonstrate that mechanistic interpretability methods developed for natural language models can be used for jet physics classifiers and indicate that gradient descent may rediscover physically meaningful aspects of jet tagging without supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript applies mechanistic interpretability methods (zero ablation, path patching with two on-manifold corruption strategies, and linear probing of the residual stream) to a Particle Transformer trained on the Top Quark Tagging dataset. It identifies a sparse six-head circuit—an early source head, middle-layer relay heads attending to pairwise substructure, and a late readout head—that recovers the great majority of the full model's jet classification performance. Linear probes indicate that the residual stream preferentially encodes energy-correlator observables (especially 2-prong substructure) over N-subjettiness, and that an apparent early commitment to classification is actually a basis rotation already saturated in the particle attention stack.

Significance. If the circuit identification and causal interventions hold, the work demonstrates that standard MI techniques transfer successfully to high-energy physics classifiers and that gradient descent can rediscover physically meaningful jet substructure features without supervision. The quantitative performance recovery, clean source-relay-readout decomposition, and explicit comparison of probe bases constitute concrete strengths; the absence of fitted parameters or invented entities in the analysis further supports its empirical grounding.

major comments (2)
  1. [§4.2] §4.2 (path patching): the two on-manifold corruption strategies are described as complementary, but the manuscript does not state whether they were pre-specified before any ablation runs or selected after observing initial results; post-hoc choice would weaken the claim that the recovered six-head circuit is the minimal faithful one.
  2. [§5.1] §5.1 and Table 3: the linear-probe results claim preferential alignment with the energy-correlator basis, yet the probe training details (regularization, number of dimensions retained, and random baseline accuracy) are not reported, so it is impossible to judge whether the reported preference exceeds what would be expected from the residual-stream dimensionality alone.
minor comments (3)
  1. [Abstract] Abstract: the phrase 'great majority of the full model performance' should be replaced by the exact recovered fraction (e.g., 87 % top-tagging accuracy) for precision.
  2. [Figure 4] Figure 4 caption: the color scale for attention maps is not labeled with numerical values, making it difficult to compare attention strengths across heads.
  3. [§3.1] §3.1: the Particle Transformer architecture diagram omits the exact number of heads per layer and the residual-stream dimension, which are needed to interpret the six-head circuit size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and positive recommendation for minor revision. We address the two major comments below and will incorporate clarifications into the revised manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (path patching): the two on-manifold corruption strategies are described as complementary, but the manuscript does not state whether they were pre-specified before any ablation runs or selected after observing initial results; post-hoc choice would weaken the claim that the recovered six-head circuit is the minimal faithful one.

    Authors: The two on-manifold corruption strategies were pre-specified prior to performing the ablation runs. They were chosen to be complementary in probing the circuit's dependence on different aspects of the input distribution. We will revise §4.2 to state this explicitly and thereby strengthen the claim that the six-head circuit is the minimal faithful one. revision: yes

  2. Referee: [§5.1] §5.1 and Table 3: the linear-probe results claim preferential alignment with the energy-correlator basis, yet the probe training details (regularization, number of dimensions retained, and random baseline accuracy) are not reported, so it is impossible to judge whether the reported preference exceeds what would be expected from the residual-stream dimensionality alone.

    Authors: We agree that these details are essential for evaluating the probe results. We will add to §5.1 and Table 3 the regularization method and strength used for training the linear probes, the number of dimensions retained from the residual stream, and the random baseline accuracy. This will allow readers to confirm that the observed preference for the energy-correlator basis exceeds what would be expected from dimensionality considerations alone. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical circuit identification via interventions

full rationale

The paper's central results rest on zero ablation, path patching, and linear probing applied to a pre-trained Particle Transformer on the external Top Quark Tagging dataset. These are interventional measurements that quantify causal contributions; the recovered six-head circuit and its source-relay-readout interpretation are outputs of those measurements rather than inputs redefined as predictions. No equations derive a quantity from itself, no fitted parameters are relabeled as predictions, and no load-bearing claims reduce to self-citations or imported uniqueness theorems. The analysis is self-contained against the held-out test set and standard interpretability benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the standard assumptions of mechanistic interpretability that interventions (zero ablation and path patching) reveal causal structure and that linear probes faithfully reflect the information present in the residual stream. No new physical entities or free parameters are introduced beyond the usual training hyperparameters of the Particle Transformer.

axioms (2)
  • domain assumption Interventional methods such as zero ablation and path patching isolate the causal contributions of individual attention heads without creating out-of-distribution artifacts that the model exploits.
    Invoked when claiming that the six-head circuit recovers the great majority of performance.
  • domain assumption Linear probes trained on the residual stream accurately measure the presence of physical observables such as energy correlators.
    Used to conclude that the model preferentially encodes 2-prong substructure.

pith-pipeline@v0.9.0 · 5568 in / 1466 out tokens · 30570 ms · 2026-05-12T04:45:09.836962+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 10 internal anchors

  1. [1]

    G. P. Salam,Towards Jetography, Eur. Phys. J. C67, 637 (2010), doi:10.1140/epjc/s10052- 010-1314-6,0906.1833

  2. [2]

    Kogleret al.,Jet Substructure at the Large Hadron Collider: Experimental Review, Rev

    R. Kogleret al.,Jet Substructure at the Large Hadron Collider: Experimental Review, Rev. Mod. Phys.91(4), 045003 (2019), doi:10.1103/RevModPhys.91.045003, 1803.06991

  3. [3]

    A. J. Larkoski, I. Moult and B. Nachman,Jet substructure at the large hadron collider: A review of recent advances in theory and machine learning, Physics Reports841, 1 (2020), doi:https://doi.org/10.1016/j.physrep.2019.11.001, Jet substructure at the Large Hadron Collider: A review of recent advances in theory and machine learning

  4. [4]

    de Oliveira, M

    L. de Oliveira, M. Kagan, L. Mackey, B. Nachman and A. Schwartzman,Jet-images — deep learning edition, JHEP07, 069 (2016), doi:10.1007/JHEP07(2016)069, 1511.05190

  5. [5]

    Qu and L

    H. Qu and L. Gouskos,ParticleNet: Jet Tagging via Particle Clouds, Phys. Rev. D 101(5), 056019 (2020), doi:10.1103/PhysRevD.101.056019,1902.08570

  6. [6]

    Thais, P

    S. Thais, P. Calafiura, G. Chachamis, G. DeZoort, J. Duarte, S. Ganguly, M. Kagan, D. Murnane, M. S. Neubauer and K. Terao,Graph Neural Networks in Particle Physics: Implementations, Innovations, and Challenges, InSnowmass 2021(2022),2203.12852

  7. [7]

    Shlomi, P

    J. Shlomi, P. Battaglia and J.-R. Vlimant,Graph Neural Networks in Particle Physics (2020), doi:10.1088/2632-2153/abbf9a,2007.13681

  8. [8]

    H. Qu, C. Li and S. Qian,Particle Transformer for Jet Tagging(2022),2202.03772

  9. [9]

    Thaler and K

    J. Thaler and K. Van Tilburg,Identifying Boosted Objects with N-subjettiness, JHEP03, 015 (2011), doi:10.1007/JHEP03(2011)015,1011.2268

  10. [10]

    Thaler and K

    J. Thaler and K. Van Tilburg,Maximizing Boosted Top Identification by Minimizing N-subjettiness, JHEP02, 093 (2012), doi:10.1007/JHEP02(2012)093,1108.2701

  11. [11]

    A. J. Larkoski, G. P. Salam and J. Thaler,Energy Correlation Functions for Jet Substructure, JHEP06, 108 (2013), doi:10.1007/JHEP06(2013)108,1305.0007

  12. [12]

    Moult, L

    I. Moult, L. Necib and J. Thaler,New Angles on Energy Correlation Functions, JHEP 12, 153 (2016), doi:10.1007/JHEP12(2016)153,1609.07483

  13. [13]

    Butteret al.,The Machine Learning landscape of top taggers, SciPost Phys.7, 014 (2019), doi:10.21468/SciPostPhys.7.1.014,1902.09914

    A. Butteret al.,The Machine Learning landscape of top taggers, SciPost Phys.7, 014 (2019), doi:10.21468/SciPostPhys.7.1.014,1902.09914

  14. [14]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin,Attention Is All You Need, arXiv e-prints arXiv:1706.03762 (2017), doi:10.48550/arXiv.1706.03762,1706.03762

  15. [15]

    Elhage, N

    N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drainet al.,A mathematical framework for transformer circuits, Transformer Circuits Thread (2021). 37 SciPost Physics Submission

  16. [16]

    K. Wang, A. Variengien, A. Conmy, B. Shlegeris and J. Steinhardt,Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, arXiv e-prints arXiv:2211.00593 (2022), doi:10.48550/arXiv.2211.00593,2211.00593

  17. [17]

    Conmy, A

    A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim and A. Garriga-Alonso, Towards Automated Circuit Discovery for Mechanistic Interpretability, arXiv e-prints arXiv:2304.14997 (2023), doi:10.48550/arXiv.2304.14997,2304.14997

  18. [18]

    Kasieczka, T

    G. Kasieczka, T. Plehn, J. Thompson and M. Russel,Top quark tagging reference dataset, doi:10.5281/zenodo.2603256 (2019)

  19. [19]

    Agarwal, L

    G. Agarwal, L. Hay, I. Iashvili, B. Mannix, C. McLean, M. Morris, S. Rappoccio and U. Schubert,Explainable AI for ML jet taggers using expert variables and layerwise relevance propagation, JHEP05, 208 (2021), doi:10.1007/JHEP05(2021)208, 2011.13466

  20. [21]

    Bhattacherjee, C

    B. Bhattacherjee, C. Bose, A. Chakraborty and R. Sengupta,Boosted top tagging and its interpretation using Shapley values, Eur. Phys. J. Plus139(12), 1131 (2024), doi:10.1140/epjp/s13360-024-05910-9,2212.11606

  21. [22]

    P. D. Patel and S. Ganguly,Explainable AI for Jet Tagging: A Comparative Study of GNNExplainer, GNNShap, and GradCAM for Jet Tagging in the Lund Jet Plane(2026), 2604.25885

  22. [23]

    Mokhtar, R

    F. Mokhtar, R. Kansal, D. Diaz, J. Duarte, J. Pata, M. Pierini and J.-R. Vlimant, Explaining machine-learned particle-flow reconstruction, In35th Conference on Neural Information Processing Systems(2021),2111.12840

  23. [24]

    Maglianella, L

    L. Maglianella, L. Nicoletti, S. Giagu, C. Napoli and S. Scardapane,Convergent approaches to AI explainability for HEP muonic particles pattern recognition, Computing and Software for Big Science7(1), 8 (2023), doi:10.1007/s41781-023-00102-z

  24. [25]

    A. Khot, X. Wang, A. Roy, V. Kindratenko and M. S. Neubauer,Evidential deep learning for uncertainty quantification and out-of-distribution detection in jet identification using deep neural networks, Mach. Learn. Sci. Tech.6(3), 035003 (2025), doi:10.1088/2632- 2153/ade51b,2501.05656

  25. [26]

    Mokhtar, R

    F. Mokhtar, R. Kansal and J. Duarte,Do graph neural networks learn traditional jet substructure?, In36th Conference on Neural Information Processing Systems: Workshop on Machine Learning and the Physical Sciences(2022),2211.09912

  26. [27]

    Andreassen, I

    A. Andreassen, I. Feige, C. Frye and M. D. Schwartz,Binary JUNIPR: an inter- pretable probabilistic model for discrimination, Phys. Rev. Lett.123(18), 182001 (2019), doi:10.1103/PhysRevLett.123.182001,1906.10137

  27. [28]

    Bendavid, D

    J. Bendavid, D. Conde, M. Morales-Alvarado, V. Sanz and M. Ubiali,Angular coefficients from interpretable machine learning with symbolic regression, JHEP02, 081 (2026), doi:10.1007/JHEP02(2026)081,2508.00989. 38 SciPost Physics Submission

  28. [29]

    Konar, V

    P. Konar, V. S. Ngairangbam, M. Spannowsky and D. Srivastava,Stable and inter- pretable jet physics with IRC-safe equivariant feature extraction, JHEP03, 219 (2026), doi:10.1007/JHEP03(2026)219,2509.22059

  29. [30]

    Bradshaw, S

    L. Bradshaw, S. Chang and B. Ostdiek,Creating simple, interpretable anomaly de- tectors for new physics in jet substructure, Phys. Rev. D106(3), 035014 (2022), doi:10.1103/PhysRevD.106.035014,2203.01343

  30. [31]

    Genovese, A

    D. Genovese, A. Sgroi, A. Devoto, S. Valentine, L. Wood, C. Sebastiani, S. Scardapane, M. D’Onofrio and S. Giagu,Mixture-of-experts graph transformers for interpretable particle collision detection, Sci. Rep.15(1), 27906 (2025), doi:10.1038/s41598-025-12003-9, 2501.03432

  31. [32]

    S. Vent, R. Winterhalder and T. Plehn,The Physics Behind ML-based Quark-Gluon Taggers, SciPost Phys.20, 084 (2026), doi:10.21468/SciPostPhys.20.3.084,2507.21214

  32. [33]

    IAFormer: Interaction-Aware Transformer network for collider data analysis

    W. Esmail, A. Hammad and M. Nojiri,IAFormer: Interaction-Aware Trans- former network for collider data analysis, SciPost Phys.20(4), 108 (2026), doi:10.21468/SciPostPhys.20.4.108,2505.03258

  33. [34]

    A. Wang, A. Gandrakota, J. Ngadiuba, V. Sahu, P. Bhatnagar, E. E. Khoda and J. Duarte, Interpreting Transformers for Jet Tagging(2024),2412.03673

  34. [35]

    Legge, A

    T. Legge, A. Wang, J. Ortiz, V. Limouzi, Z. Zhao, A. Gandrakota, E. E. Khoda, J. Nga- diuba, J. Duarte and R. Cavanaugh,Why Is Attention Sparse In Particle Transformer?, In39th Annual Conference on Neural Information Processing Systems: Includes Machine Learning and the Physical Sciences (ML4PS)(2025),2512.00210

  35. [36]

    What exactly did the Transformer learn from our physics data?

    M. Erdmann, N. Langner, J. Schulte and D. Wirtz,What Exactly Did the Trans- former Learn from Our Physics Data?, Comput. Softw. Big Sci.9(1), 16 (2025), doi:10.1007/s41781-025-00145-4,2505.21042

  36. [37]

    nostalgebraist,interpreting GPT: the logit lens, LessWrong (2020)

  37. [38]

    Understanding intermediate layers using linear classifier probes

    G. Alain and Y. Bengio,Understanding intermediate layers using linear classifier probes, arXiv e-prints arXiv:1610.01644 (2016), doi:10.48550/arXiv.1610.01644,1610.01644

  38. [39]

    W. N. van Wieringen,Lecture notes on ridge regression, arXiv e-prints arXiv:1509.09169 (2015), doi:10.48550/arXiv.1509.09169,1509.09169

  39. [40]

    A comprehensive guide to the physics and usage of PYTHIA 8.3

    C. Bierlichet al.,A comprehensive guide to the physics and usage of PYTHIA 8.3, SciPost Phys. Codeb.2022, 8 (2022), doi:10.21468/SciPostPhysCodeb.8,2203.11601

  40. [41]

    Cacciari, G.P

    M. Cacciari, G. P. Salam and G. Soyez,The anti- kt jet clustering algorithm, JHEP04, 063 (2008), doi:10.1088/1126-6708/2008/04/063,0802.1189

  41. [42]

    Cacciari, G.P

    M. Cacciari, G. P. Salam and G. Soyez,FastJet User Manual, Eur. Phys. J. C72, 1896 (2012), doi:10.1140/epjc/s10052-012-1896-2,1111.6097

  42. [43]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter,Decoupled Weight Decay Regularization, arXiv e-prints arXiv:1711.05101 (2017), doi:10.48550/arXiv.1711.05101,1711.05101. 39 SciPost Physics Submission

  43. [44]

    Normformer: Improved transformer pretraining with extra normalization

    S. Shleifer, J. Weston and M. Ott,NormFormer: Improved Transformer Pretraining with Extra Normalization, arXiv e-prints arXiv:2110.09456 (2021), doi:10.48550/arXiv.2110.09456,2110.09456

  44. [45]

    A. J. Larkoski, I. Moult and D. Neill,Power Counting to Better Jet Observables, JHEP 12, 009 (2014), doi:10.1007/JHEP12(2014)009,1409.6298

  45. [46]

    A. J. Larkoski, I. Moult and D. Neill,Analytic Boosted Boson Discrimination, JHEP05, 117 (2016), doi:10.1007/JHEP05(2016)117,1507.03018

  46. [47]

    C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picuset al.,Array programming with NumPy, Nature585(7825), 357 (2020), doi:10.1038/s41586-020-2649-2

  47. [48]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit and N. Houlsby,An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv e-prints arXiv:2010.11929 (2020), doi:10.48550/arXiv.2010.11929,2010.11929. 40