arxiv: 2605.09881 · v1 · submitted 2026-05-11 · ✦ hep-ph · cs.LG· hep-ex

Recognition: 1 theorem link

· Lean Theorem

Dissecting Jet-Tagger Through Mechanistic Interpretability

Sanmay Ganguly, Saurabh Rai

Pith reviewed 2026-05-12 04:45 UTC · model grok-4.3

classification ✦ hep-ph cs.LGhep-ex

keywords jet taggingmechanistic interpretabilityparticle transformertop quark taggingenergy correlatorsattention headsresidual streamsubstructure

0 comments

The pith

A six-head circuit in a particle transformer recovers most top-quark jet tagging performance through a source-relay-readout structure aligned with energy correlators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies mechanistic interpretability tools to a Particle Transformer trained for top-quark jet tagging. It locates a sparse six-head subnetwork that preserves the great majority of the full model's accuracy while exposing a clean internal flow: one early head sources the signal, middle heads relay by attending to pairwise substructure, and a late head reads out the aggregated decision. Linear probes on the residual stream show stronger alignment with energy correlator observables than with N-subjettiness, and a preference for two-prong over three-prong features. An apparent early classification commitment turns out to be a basis rotation after the discriminating signal has already saturated inside the particle attention stack. These findings matter because they indicate that gradient descent on jet data can rediscover physically meaningful structures without explicit supervision, allowing the same interpretability methods used on language models to be transferred to high-energy physics classifiers.

Core claim

Combining zero ablation, path patching with two complementary on-manifold corruption strategies, and linear probing of the residual stream, the authors identify a sparse six-head circuit that recovers the great majority of the full model performance while admitting a clean source-relay-readout interpretation. In this circuit a single early-layer head serves as the primary causal source, a cluster of middle-layer heads acts as relays selectively attending to hard pairwise substructure, and a single late-layer head reads out the aggregated signal. Linear probes show that the residual stream is preferentially aligned with the energy correlator basis over the N-subjettiness basis, with stronger

What carries the argument

The sparse six-head circuit recovered by zero ablation and path patching, which implements source-relay-readout flow and aligns the residual stream with energy correlator observables.

If this is right

The bulk of the classification decision can be explained by a small, human-readable subgraph rather than the full network.
The internal representations encode two-prong jet substructure observables drawn from the energy correlator family.
The model performs an internal basis rotation early in the network rather than committing to a classification in a single step.
Gradient descent applied to jet data can rediscover physically meaningful structures without explicit supervision.
Mechanistic interpretability methods developed for language models transfer directly to jet-physics classifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The discovered circuit could be used as a starting point for constructing lighter, more verifiable jet taggers that implement the same logic explicitly.
The preference for energy correlators suggests the model exploits QCD symmetries that might be missed by architectures trained only on N-subjettiness variables.
Repeating the analysis on other tagging datasets or architectures would test whether different training regimes produce different circuits or the same energy-correlator alignment.

Load-bearing premise

The chosen on-manifold corruption strategies and zero-ablation interventions isolate the true causal circuit without introducing new artifacts that the model was never trained to handle.

What would settle it

Ablating every head outside the six identified ones and checking whether the performance drop is as small as claimed, or running linear probes and verifying that alignment with energy correlator features exceeds alignment with N-subjettiness features.

Figures

Figures reproduced from arXiv: 2605.09881 by Sanmay Ganguly, Saurabh Rai.

**Figure 2.** Figure 2: Logit-lens AUC as a function of representation depth. At each depth, the mean [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Mean logit difference ⟨log ptop − log pQCD⟩ as a function of representation depth, evaluated separately for top jets (red) and QCD jets (blue) over 10,000 test jets. Through all particle attention layers the two classes have similar and near-zero logit differences. A sharp class separation emerges only at the first class attention block (Cls0). Read at face value, these observations suggest that no class-d… view at source ↗

**Figure 4.** Figure 4: Comparison of logit-lens AUC (blue circles) and per-layer trained logistic probe [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Head importance measured by zero ablation, averaged over five independent training [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Direct effect (path-patching recovery score) for all 16 heads. Heads not in the [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: The identified circuit. Nodes represent the six circuit heads; node color encodes [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Partial minimality test. Test AUC of the partial circuit as heads are added one at [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Random-baseline comparison. Test AUC distribution for 200 randomly sampled [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Linear probe R2 for jet-level observables as a function of representation depth, including both particle attention layers (L0–L4) and class attention blocks (Cls0, Cls1). Jet mass and the leading-pT fraction are encoded with high R2 from the embedding layer; τ32 and τ21 improve more gradually through the particle attention stack. The binary is top probe shows only modest improvement between L4 and the cla… view at source ↗

**Figure 11.** Figure 11: Linear probe R2 for energy correlator observables as a function of representation depth. Left: the 3-prong observables C (β=1) 3 (red) and N (β=1) 3 (orange) against the Nsubjettiness reference τ32 (grey, dashed). Right: the same 3-prong observables compared to the 2-prong observables C (β=1) 2 (blue, dashed) and D (β=1) 2 (green, dashed). The shaded region marks the class attention blocks. The 3-prong o… view at source ↗

**Figure 12.** Figure 12: Linear probe R2 for D (β=1) 2 (blue) and τ32 (red) in raw form (left) and after residualization against jet mass (right). The dashed grey line on the left panel shows the jet mass probe R2 for reference. The D2-over-τ32 encoding advantage is preserved, and indeed widens slightly, after the contribution from jet mass has been removed. 8.2 Causal feature ablation The Pearson correlations of Section 8.1 are … view at source ↗

**Figure 13.** Figure 13: Causal feature ablation. For each of the four pairwise features (coloured dashed [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Class-discriminating powers δ2 (blue) and δ3 (red) for the six circuit heads, computed from the Pearson correlations between the head’s attention weights and the pairwise proxies of Eqs. (3)–(4) on top and QCD jets over 2,000 test jets. For L0H2, L1H0, L1H3, and L3H3, the 3-prong discriminating power exceeds the 2-prong discriminating power; for L0H1 and L1H1, both are negligible. 9 Discussion In this sec… view at source ↗

read the original abstract

Mechanistic interpretability seeks to reverse engineer a trained neural network by identifying the minimal subset of internal components. We perform a mechanistic interpretability analysis of the Particle Transformer architecture, trained on the Top Quark Tagging reference dataset, with the goal of identifying the computational circuit responsible for jet classification and characterizing the physical content of its internal representations. Combining zero ablation, path patching with two complementary on-manifold corruption strategies and linear probing of the residual stream, we identify a sparse six-head circuit that recovers the great majority of the full model performance while admitting a clean source-relay-readout interpretation. In this circuit, a single early layer head serves as the primary causal source, a cluster of middle-layer heads acts as relays selectively attending to hard pairwise substructure and a single late-layer head reads out the aggregated signal. Linear probes show that the residual stream is preferentially aligned with the energy correlator basis over the $N$-subjettiness basis. Within the energy correlator basis, the model preferentially encodes 2-prong substructure observables over the 3-prong observables. A per-layer trained probe further reveals that the apparent single step commitment of the model to a classification decision in the first class attention block is in fact a basis rotation, with the discriminating signal already saturating in the particle attention stack. These results demonstrate that mechanistic interpretability methods developed for natural language models can be used for jet physics classifiers and indicate that gradient descent may rediscover physically meaningful aspects of jet tagging without supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies full mechanistic interpretability to a Particle Transformer jet tagger and recovers a sparse six-head circuit aligned with energy correlators, but the zero-ablation and on-manifold interventions leave the causal claims somewhat soft.

read the letter

The main thing to know is that this work shows standard mechanistic interpretability tools can pull a small, interpretable circuit out of a real collider classifier and that the circuit tracks physically familiar quantities like energy correlators more than N-subjettiness. They run zero ablation, path patching with two on-manifold corruptions, and linear probes on the Particle Transformer trained on the usual top-tagging dataset. The result is a six-head circuit with an early source head, middle relay heads that attend to pairwise substructure, and a late readout head. The probes indicate the residual stream prefers the energy-correlator basis, with stronger encoding of 2-prong than 3-prong observables, and the per-layer probes show the signal saturates early rather than committing in one step. This is the first reported use of the full toolkit on this architecture for this task, and the quantitative performance recovery is reported clearly enough to be useful. The methods are applied systematically and the source-relay-readout picture is a clean way to describe what they found. The soft spot is the intervention strategy itself. Zero ablation and on-manifold replacements on variable-length jet constituents can still drive the model into activation regimes it never saw during training, which risks creating or masking causal paths. Without extra checks such as alternative patching methods or tests on the isolated circuit, the physical interpretation rests on somewhat brittle evidence. The paper is aimed at people who already know both jet substructure and mechanistic interpretability. A reader in that overlap will get concrete value from the circuit description and the basis comparison. It is solid enough to deserve a serious referee, though the authors should be asked to strengthen the validation of the interventions. I would send it out for review.

Referee Report

2 major / 3 minor

Summary. The manuscript applies mechanistic interpretability methods (zero ablation, path patching with two on-manifold corruption strategies, and linear probing of the residual stream) to a Particle Transformer trained on the Top Quark Tagging dataset. It identifies a sparse six-head circuit—an early source head, middle-layer relay heads attending to pairwise substructure, and a late readout head—that recovers the great majority of the full model's jet classification performance. Linear probes indicate that the residual stream preferentially encodes energy-correlator observables (especially 2-prong substructure) over N-subjettiness, and that an apparent early commitment to classification is actually a basis rotation already saturated in the particle attention stack.

Significance. If the circuit identification and causal interventions hold, the work demonstrates that standard MI techniques transfer successfully to high-energy physics classifiers and that gradient descent can rediscover physically meaningful jet substructure features without supervision. The quantitative performance recovery, clean source-relay-readout decomposition, and explicit comparison of probe bases constitute concrete strengths; the absence of fitted parameters or invented entities in the analysis further supports its empirical grounding.

major comments (2)

[§4.2] §4.2 (path patching): the two on-manifold corruption strategies are described as complementary, but the manuscript does not state whether they were pre-specified before any ablation runs or selected after observing initial results; post-hoc choice would weaken the claim that the recovered six-head circuit is the minimal faithful one.
[§5.1] §5.1 and Table 3: the linear-probe results claim preferential alignment with the energy-correlator basis, yet the probe training details (regularization, number of dimensions retained, and random baseline accuracy) are not reported, so it is impossible to judge whether the reported preference exceeds what would be expected from the residual-stream dimensionality alone.

minor comments (3)

[Abstract] Abstract: the phrase 'great majority of the full model performance' should be replaced by the exact recovered fraction (e.g., 87 % top-tagging accuracy) for precision.
[Figure 4] Figure 4 caption: the color scale for attention maps is not labeled with numerical values, making it difficult to compare attention strengths across heads.
[§3.1] §3.1: the Particle Transformer architecture diagram omits the exact number of heads per layer and the residual-stream dimension, which are needed to interpret the six-head circuit size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and positive recommendation for minor revision. We address the two major comments below and will incorporate clarifications into the revised manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (path patching): the two on-manifold corruption strategies are described as complementary, but the manuscript does not state whether they were pre-specified before any ablation runs or selected after observing initial results; post-hoc choice would weaken the claim that the recovered six-head circuit is the minimal faithful one.

Authors: The two on-manifold corruption strategies were pre-specified prior to performing the ablation runs. They were chosen to be complementary in probing the circuit's dependence on different aspects of the input distribution. We will revise §4.2 to state this explicitly and thereby strengthen the claim that the six-head circuit is the minimal faithful one. revision: yes
Referee: [§5.1] §5.1 and Table 3: the linear-probe results claim preferential alignment with the energy-correlator basis, yet the probe training details (regularization, number of dimensions retained, and random baseline accuracy) are not reported, so it is impossible to judge whether the reported preference exceeds what would be expected from the residual-stream dimensionality alone.

Authors: We agree that these details are essential for evaluating the probe results. We will add to §5.1 and Table 3 the regularization method and strength used for training the linear probes, the number of dimensions retained from the residual stream, and the random baseline accuracy. This will allow readers to confirm that the observed preference for the energy-correlator basis exceeds what would be expected from dimensionality considerations alone. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical circuit identification via interventions

full rationale

The paper's central results rest on zero ablation, path patching, and linear probing applied to a pre-trained Particle Transformer on the external Top Quark Tagging dataset. These are interventional measurements that quantify causal contributions; the recovered six-head circuit and its source-relay-readout interpretation are outputs of those measurements rather than inputs redefined as predictions. No equations derive a quantity from itself, no fitted parameters are relabeled as predictions, and no load-bearing claims reduce to self-citations or imported uniqueness theorems. The analysis is self-contained against the held-out test set and standard interpretability benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the standard assumptions of mechanistic interpretability that interventions (zero ablation and path patching) reveal causal structure and that linear probes faithfully reflect the information present in the residual stream. No new physical entities or free parameters are introduced beyond the usual training hyperparameters of the Particle Transformer.

axioms (2)

domain assumption Interventional methods such as zero ablation and path patching isolate the causal contributions of individual attention heads without creating out-of-distribution artifacts that the model exploits.
Invoked when claiming that the six-head circuit recovers the great majority of performance.
domain assumption Linear probes trained on the residual stream accurately measure the presence of physical observables such as energy correlators.
Used to conclude that the model preferentially encodes 2-prong substructure.

pith-pipeline@v0.9.0 · 5568 in / 1466 out tokens · 30570 ms · 2026-05-12T04:45:09.836962+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Combining zero ablation, path patching with two complementary on-manifold corruption strategies and linear probing of the residual stream, we identify a sparse six-head circuit...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 10 internal anchors

[1]

G. P. Salam,Towards Jetography, Eur. Phys. J. C67, 637 (2010), doi:10.1140/epjc/s10052- 010-1314-6,0906.1833

work page doi:10.1140/epjc/s10052- 2010
[2]

Kogleret al.,Jet Substructure at the Large Hadron Collider: Experimental Review, Rev

R. Kogleret al.,Jet Substructure at the Large Hadron Collider: Experimental Review, Rev. Mod. Phys.91(4), 045003 (2019), doi:10.1103/RevModPhys.91.045003, 1803.06991

work page doi:10.1103/revmodphys.91.045003 2019
[3]

A. J. Larkoski, I. Moult and B. Nachman,Jet substructure at the large hadron collider: A review of recent advances in theory and machine learning, Physics Reports841, 1 (2020), doi:https://doi.org/10.1016/j.physrep.2019.11.001, Jet substructure at the Large Hadron Collider: A review of recent advances in theory and machine learning

work page doi:10.1016/j.physrep.2019.11.001 2020
[4]

de Oliveira, M

L. de Oliveira, M. Kagan, L. Mackey, B. Nachman and A. Schwartzman,Jet-images — deep learning edition, JHEP07, 069 (2016), doi:10.1007/JHEP07(2016)069, 1511.05190

work page doi:10.1007/jhep07(2016)069 2016
[5]

Qu and L

H. Qu and L. Gouskos,ParticleNet: Jet Tagging via Particle Clouds, Phys. Rev. D 101(5), 056019 (2020), doi:10.1103/PhysRevD.101.056019,1902.08570

work page doi:10.1103/physrevd.101.056019 2020
[6]

Thais, P

S. Thais, P. Calafiura, G. Chachamis, G. DeZoort, J. Duarte, S. Ganguly, M. Kagan, D. Murnane, M. S. Neubauer and K. Terao,Graph Neural Networks in Particle Physics: Implementations, Innovations, and Challenges, InSnowmass 2021(2022),2203.12852

work page arXiv 2021
[7]

Shlomi, P

J. Shlomi, P. Battaglia and J.-R. Vlimant,Graph Neural Networks in Particle Physics (2020), doi:10.1088/2632-2153/abbf9a,2007.13681

work page doi:10.1088/2632-2153/abbf9a 2020
[8]

H. Qu, C. Li and S. Qian,Particle Transformer for Jet Tagging(2022),2202.03772

work page arXiv 2022
[9]

Thaler and K

J. Thaler and K. Van Tilburg,Identifying Boosted Objects with N-subjettiness, JHEP03, 015 (2011), doi:10.1007/JHEP03(2011)015,1011.2268

work page doi:10.1007/jhep03(2011)015 2011
[10]

Thaler and K

J. Thaler and K. Van Tilburg,Maximizing Boosted Top Identification by Minimizing N-subjettiness, JHEP02, 093 (2012), doi:10.1007/JHEP02(2012)093,1108.2701

work page doi:10.1007/jhep02(2012)093 2012
[11]

A. J. Larkoski, G. P. Salam and J. Thaler,Energy Correlation Functions for Jet Substructure, JHEP06, 108 (2013), doi:10.1007/JHEP06(2013)108,1305.0007

work page doi:10.1007/jhep06(2013)108 2013
[12]

Moult, L

I. Moult, L. Necib and J. Thaler,New Angles on Energy Correlation Functions, JHEP 12, 153 (2016), doi:10.1007/JHEP12(2016)153,1609.07483

work page doi:10.1007/jhep12(2016)153 2016
[13]

Butteret al.,The Machine Learning landscape of top taggers, SciPost Phys.7, 014 (2019), doi:10.21468/SciPostPhys.7.1.014,1902.09914

A. Butteret al.,The Machine Learning landscape of top taggers, SciPost Phys.7, 014 (2019), doi:10.21468/SciPostPhys.7.1.014,1902.09914

work page doi:10.21468/scipostphys.7.1.014 2019
[14]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin,Attention Is All You Need, arXiv e-prints arXiv:1706.03762 (2017), doi:10.48550/arXiv.1706.03762,1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2017
[15]

Elhage, N

N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drainet al.,A mathematical framework for transformer circuits, Transformer Circuits Thread (2021). 37 SciPost Physics Submission

work page 2021
[16]

K. Wang, A. Variengien, A. Conmy, B. Shlegeris and J. Steinhardt,Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, arXiv e-prints arXiv:2211.00593 (2022), doi:10.48550/arXiv.2211.00593,2211.00593

work page internal anchor Pith review doi:10.48550/arxiv.2211.00593 2022
[17]

Conmy, A

A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim and A. Garriga-Alonso, Towards Automated Circuit Discovery for Mechanistic Interpretability, arXiv e-prints arXiv:2304.14997 (2023), doi:10.48550/arXiv.2304.14997,2304.14997

work page doi:10.48550/arxiv.2304.14997 2023
[18]

Kasieczka, T

G. Kasieczka, T. Plehn, J. Thompson and M. Russel,Top quark tagging reference dataset, doi:10.5281/zenodo.2603256 (2019)

work page doi:10.5281/zenodo.2603256 2019
[19]

Agarwal, L

G. Agarwal, L. Hay, I. Iashvili, B. Mannix, C. McLean, M. Morris, S. Rappoccio and U. Schubert,Explainable AI for ML jet taggers using expert variables and layerwise relevance propagation, JHEP05, 208 (2021), doi:10.1007/JHEP05(2021)208, 2011.13466

work page doi:10.1007/jhep05(2021)208 2021
[21]

Bhattacherjee, C

B. Bhattacherjee, C. Bose, A. Chakraborty and R. Sengupta,Boosted top tagging and its interpretation using Shapley values, Eur. Phys. J. Plus139(12), 1131 (2024), doi:10.1140/epjp/s13360-024-05910-9,2212.11606

work page doi:10.1140/epjp/s13360-024-05910-9 2024
[22]

P. D. Patel and S. Ganguly,Explainable AI for Jet Tagging: A Comparative Study of GNNExplainer, GNNShap, and GradCAM for Jet Tagging in the Lund Jet Plane(2026), 2604.25885

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Mokhtar, R

F. Mokhtar, R. Kansal, D. Diaz, J. Duarte, J. Pata, M. Pierini and J.-R. Vlimant, Explaining machine-learned particle-flow reconstruction, In35th Conference on Neural Information Processing Systems(2021),2111.12840

work page arXiv 2021
[24]

Maglianella, L

L. Maglianella, L. Nicoletti, S. Giagu, C. Napoli and S. Scardapane,Convergent approaches to AI explainability for HEP muonic particles pattern recognition, Computing and Software for Big Science7(1), 8 (2023), doi:10.1007/s41781-023-00102-z

work page doi:10.1007/s41781-023-00102-z 2023
[25]

A. Khot, X. Wang, A. Roy, V. Kindratenko and M. S. Neubauer,Evidential deep learning for uncertainty quantification and out-of-distribution detection in jet identification using deep neural networks, Mach. Learn. Sci. Tech.6(3), 035003 (2025), doi:10.1088/2632- 2153/ade51b,2501.05656

work page doi:10.1088/2632- 2025
[26]

Mokhtar, R

F. Mokhtar, R. Kansal and J. Duarte,Do graph neural networks learn traditional jet substructure?, In36th Conference on Neural Information Processing Systems: Workshop on Machine Learning and the Physical Sciences(2022),2211.09912

work page arXiv 2022
[27]

Andreassen, I

A. Andreassen, I. Feige, C. Frye and M. D. Schwartz,Binary JUNIPR: an inter- pretable probabilistic model for discrimination, Phys. Rev. Lett.123(18), 182001 (2019), doi:10.1103/PhysRevLett.123.182001,1906.10137

work page doi:10.1103/physrevlett.123.182001 2019
[28]

Bendavid, D

J. Bendavid, D. Conde, M. Morales-Alvarado, V. Sanz and M. Ubiali,Angular coefficients from interpretable machine learning with symbolic regression, JHEP02, 081 (2026), doi:10.1007/JHEP02(2026)081,2508.00989. 38 SciPost Physics Submission

work page doi:10.1007/jhep02(2026)081 2026
[29]

Konar, V

P. Konar, V. S. Ngairangbam, M. Spannowsky and D. Srivastava,Stable and inter- pretable jet physics with IRC-safe equivariant feature extraction, JHEP03, 219 (2026), doi:10.1007/JHEP03(2026)219,2509.22059

work page doi:10.1007/jhep03(2026)219 2026
[30]

Bradshaw, S

L. Bradshaw, S. Chang and B. Ostdiek,Creating simple, interpretable anomaly de- tectors for new physics in jet substructure, Phys. Rev. D106(3), 035014 (2022), doi:10.1103/PhysRevD.106.035014,2203.01343

work page doi:10.1103/physrevd.106.035014 2022
[31]

Genovese, A

D. Genovese, A. Sgroi, A. Devoto, S. Valentine, L. Wood, C. Sebastiani, S. Scardapane, M. D’Onofrio and S. Giagu,Mixture-of-experts graph transformers for interpretable particle collision detection, Sci. Rep.15(1), 27906 (2025), doi:10.1038/s41598-025-12003-9, 2501.03432

work page doi:10.1038/s41598-025-12003-9 2025
[32]

S. Vent, R. Winterhalder and T. Plehn,The Physics Behind ML-based Quark-Gluon Taggers, SciPost Phys.20, 084 (2026), doi:10.21468/SciPostPhys.20.3.084,2507.21214

work page internal anchor Pith review Pith/arXiv arXiv doi:10.21468/scipostphys.20.3.084 2026
[33]

IAFormer: Interaction-Aware Transformer network for collider data analysis

W. Esmail, A. Hammad and M. Nojiri,IAFormer: Interaction-Aware Trans- former network for collider data analysis, SciPost Phys.20(4), 108 (2026), doi:10.21468/SciPostPhys.20.4.108,2505.03258

work page internal anchor Pith review Pith/arXiv arXiv doi:10.21468/scipostphys.20.4.108 2026
[34]

A. Wang, A. Gandrakota, J. Ngadiuba, V. Sahu, P. Bhatnagar, E. E. Khoda and J. Duarte, Interpreting Transformers for Jet Tagging(2024),2412.03673

work page arXiv 2024
[35]

Legge, A

T. Legge, A. Wang, J. Ortiz, V. Limouzi, Z. Zhao, A. Gandrakota, E. E. Khoda, J. Nga- diuba, J. Duarte and R. Cavanaugh,Why Is Attention Sparse In Particle Transformer?, In39th Annual Conference on Neural Information Processing Systems: Includes Machine Learning and the Physical Sciences (ML4PS)(2025),2512.00210

work page arXiv 2025
[36]

What exactly did the Transformer learn from our physics data?

M. Erdmann, N. Langner, J. Schulte and D. Wirtz,What Exactly Did the Trans- former Learn from Our Physics Data?, Comput. Softw. Big Sci.9(1), 16 (2025), doi:10.1007/s41781-025-00145-4,2505.21042

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s41781-025-00145-4 2025
[37]

nostalgebraist,interpreting GPT: the logit lens, LessWrong (2020)

work page 2020
[38]

Understanding intermediate layers using linear classifier probes

G. Alain and Y. Bengio,Understanding intermediate layers using linear classifier probes, arXiv e-prints arXiv:1610.01644 (2016), doi:10.48550/arXiv.1610.01644,1610.01644

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1610.01644 2016
[39]

W. N. van Wieringen,Lecture notes on ridge regression, arXiv e-prints arXiv:1509.09169 (2015), doi:10.48550/arXiv.1509.09169,1509.09169

work page doi:10.48550/arxiv.1509.09169 2015
[40]

A comprehensive guide to the physics and usage of PYTHIA 8.3

C. Bierlichet al.,A comprehensive guide to the physics and usage of PYTHIA 8.3, SciPost Phys. Codeb.2022, 8 (2022), doi:10.21468/SciPostPhysCodeb.8,2203.11601

work page internal anchor Pith review Pith/arXiv arXiv doi:10.21468/scipostphyscodeb.8 2022
[41]

Cacciari, G.P

M. Cacciari, G. P. Salam and G. Soyez,The anti- kt jet clustering algorithm, JHEP04, 063 (2008), doi:10.1088/1126-6708/2008/04/063,0802.1189

work page doi:10.1088/1126-6708/2008/04/063 2008
[42]

Cacciari, G.P

M. Cacciari, G. P. Salam and G. Soyez,FastJet User Manual, Eur. Phys. J. C72, 1896 (2012), doi:10.1140/epjc/s10052-012-1896-2,1111.6097

work page doi:10.1140/epjc/s10052-012-1896-2 2012
[43]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter,Decoupled Weight Decay Regularization, arXiv e-prints arXiv:1711.05101 (2017), doi:10.48550/arXiv.1711.05101,1711.05101. 39 SciPost Physics Submission

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1711.05101 2017
[44]

Normformer: Improved transformer pretraining with extra normalization

S. Shleifer, J. Weston and M. Ott,NormFormer: Improved Transformer Pretraining with Extra Normalization, arXiv e-prints arXiv:2110.09456 (2021), doi:10.48550/arXiv.2110.09456,2110.09456

work page doi:10.48550/arxiv.2110.09456 2021
[45]

A. J. Larkoski, I. Moult and D. Neill,Power Counting to Better Jet Observables, JHEP 12, 009 (2014), doi:10.1007/JHEP12(2014)009,1409.6298

work page doi:10.1007/jhep12(2014)009 2014
[46]

A. J. Larkoski, I. Moult and D. Neill,Analytic Boosted Boson Discrimination, JHEP05, 117 (2016), doi:10.1007/JHEP05(2016)117,1507.03018

work page doi:10.1007/jhep05(2016)117 2016
[47]

C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picuset al.,Array programming with NumPy, Nature585(7825), 357 (2020), doi:10.1038/s41586-020-2649-2

work page doi:10.1038/s41586-020-2649-2 2020
[48]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit and N. Houlsby,An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv e-prints arXiv:2010.11929 (2020), doi:10.48550/arXiv.2010.11929,2010.11929. 40

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2010