arxiv: 2605.12011 · v1 · submitted 2026-05-12 · ⚛️ physics.ins-det · hep-ex· hep-ph

Recognition: no theorem link

CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

Gongxing Sun, Zhengkun Huang

Pith reviewed 2026-05-13 04:35 UTC · model grok-4.3

classification ⚛️ physics.ins-det hep-exhep-ph

keywords calorimeter shower simulationdiffusion transformerx-predictionflow matchingvoxel generationhigh-granularity detectorgenerative modeling

0 comments

The pith

Large-patch x-prediction in voxel-space diffusion transformers generates high-granularity calorimeter showers efficiently without a latent tokenizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a diffusion transformer architecture that tokenizes the full voxel grid of a calorimeter into large patches and employs x-prediction during training. This setup is applied to two high-dimensional shower datasets, where it produces synthetic events that match or improve upon prior quality measures while keeping generation times around ten milliseconds on a single GPU. The approach is presented as a direct route to raw voxel generation that avoids the extra stage of training a separate tokenizer. A reader would care because calorimeter simulations are a major computational bottleneck in particle physics, and any method that lowers the cost of producing faithful high-resolution showers could speed up experiment design and data analysis.

Core claim

CaloArt is a DiT-style backbone augmented with three-dimensional positional encodings and trained by conditional flow matching in which the prediction target and loss space are decoupled; on the smaller-patch Dataset 2 the model records the strongest Fréchet particle distance, high-level observables, and classifier accuracy among reported methods, while on the 40 500-voxel Dataset 3 the same architecture with large patches and x-prediction improves every metric relative to v-prediction and sits on the quality-versus-time frontier, both variants completing a shower in roughly 10 ms.

What carries the argument

Large-patch tokenization of the raw voxel grid inside a diffusion transformer that uses x-prediction under conditional flow matching.

If this is right

On grids small enough for fine patches, x-prediction remains competitive with v-prediction while simplifying the training objective.
On grids large enough that only coarse patches are feasible, x-prediction demonstrably raises all evaluated fidelity scores.
Single-GPU inference stays below twelve milliseconds per shower for both tested resolutions.
The method removes the requirement to train and maintain a separate latent-space tokenizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same large-patch x-prediction recipe could be tested on other high-dimensional physics simulation tasks such as tracking or detector response in different geometries.
If the reported metrics continue to correlate with downstream physics analyses, the approach would directly reduce the wall-clock time needed for Monte Carlo campaigns in current and future collider experiments.
Extending the architecture to conditional generation on additional variables such as incident particle energy or angle would be a natural next measurement of its flexibility.

Load-bearing premise

The chosen metrics of Fréchet particle distance, high-level observables, and ResNet classifier accuracy are sufficient to certify that the generated showers reproduce the full physics content of real data across all relevant phase-space regions.

What would settle it

A direct comparison in which showers produced by the model deviate from full Geant4 simulation in shower-shape moments, energy-flow correlations, or rare topologies that are invisible to the reported FPD and classifier scores.

read the original abstract

High-granularity calorimeters make ML-based fast shower simulation a high-dimensional generative modeling problem, where voxel-space generators must balance physics fidelity with training and inference cost. This work studies large-patch tokenization with x-prediction, enabling efficient raw voxel generation. We propose CaloArt, a modernized DiT-style backbone with 3D positional encoding and architectural refinements, trained via conditional flow matching with decoupled prediction and loss spaces. On CaloChallenge Dataset 2, where small patch size remains affordable, v-prediction performs well, and CaloArt achieves the best FPD, strongest high-level metrics, and strongest ResNet classifier metrics. On CaloChallenge Dataset 3, the 40500-voxel grid makes large patches necessary; x-prediction improves all reported metrics over v-prediction and places CaloArt on the quality-generation-time Pareto frontier. The final CCD2 and CCD3 models both retain O(10) ms single-GPU generation time, with 9.71 and 11.14 ms per shower. These results support large-patch voxel-space diffusion transformers with x-prediction as a compute-efficient route to high-granularity calorimeter shower synthesis, reducing training and inference cost without a pretrained latent tokenizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CaloArt gives a working large-patch x-prediction 3D DiT recipe for direct voxel calorimeter generation that hits strong reported metrics and 10 ms inference, but the evaluation leaves the physics fidelity claim under-supported.

read the letter

The one thing to know is that this paper gives a working recipe for fast, high-granularity calorimeter shower generation using a diffusion transformer that operates on large patches in voxel space and uses x-prediction instead of the more common v-prediction. On the two CaloChallenge datasets it hits top marks on the reported metrics while keeping inference under 12 milliseconds per shower on one GPU. What is new is the specific mix: large-patch 3D tokenization combined with x-prediction and a modernized DiT architecture trained with conditional flow matching where prediction and loss spaces are decoupled. The authors show that on the smaller Dataset 2, where small patches are still feasible, their model outperforms previous approaches on Fréchet Particle Distance, high-level shower observables, and a ResNet-based classifier test. On the much larger Dataset 3 with 40500 voxels, switching to x-prediction improves results over v-prediction and keeps the model competitive on the speed-quality curve without needing a separate pretrained autoencoder for compression. The paper does well at demonstrating a practical path that avoids the overhead of latent diffusion while still scaling to high resolution. The architectural refinements and the choice of flow matching seem well suited to the conditional generation task here. The soft spots are mostly in how thoroughly the claims are supported. The abstract and results lack quantitative uncertainties on the metrics, and there are no ablation studies that separate the contribution of x-prediction from the other changes like the 3D positional encoding or the backbone updates. On Dataset 3 the comparisons are limited because small-patch methods are not feasible, so it is not clear whether the gains reflect better modeling of the full shower physics or just stronger optimization on the chosen proxies. Those proxies, while standard, can be blind to certain localized deposition patterns or long-range correlations that affect downstream physics analyses. This work is for physicists and simulation developers who need faster alternatives to full Geant4 for calorimeter response in high-energy experiments. A reader looking for an off-the-shelf method to generate large numbers of showers quickly will get concrete numbers and code-level details to try. It is not a foundational theoretical advance, but the engineering is solid enough that it merits a full review. I recommend putting it through peer review. The practical impact is real, and referees can verify the implementation details and push on the metric sufficiency question.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces CaloArt, a DiT-style diffusion transformer using large-patch tokenization and x-prediction for raw voxel-space generation of high-granularity calorimeter showers. Trained with conditional flow matching, it reports best-in-class FPD, high-level observables, and ResNet classifier accuracy on CaloChallenge Datasets 2 and 3, with single-GPU generation times of 9.71 ms and 11.14 ms per shower. The central claim is that this large-patch x-prediction approach offers a compute-efficient route to high-fidelity synthesis without a pretrained latent tokenizer.

Significance. If the results hold under scrutiny, the work could meaningfully advance fast shower simulation for high-energy physics, where high-granularity calorimeters demand scalable generative models to replace costly GEANT4 runs. Credit is due for explicit reporting of inference times, use of public benchmark datasets, and direct voxel-space operation that avoids latent tokenizer overhead. The architectural refinements (3D positional encoding, decoupled prediction/loss spaces) are clearly motivated. Significance is limited by the empirical nature of the claims and the need for stronger validation that the chosen proxies capture physics-relevant features across the full shower phase space.

major comments (3)

[Abstract] Abstract and results on Dataset 3: the claim that x-prediction improves all reported metrics over v-prediction lacks an ablation isolating its effect from other changes (3D positional encoding, architectural refinements). This attribution is load-bearing for the central claim that x-prediction is the key enabler of the compute-efficient route.
[Results] Results section (Dataset 3, 40500-voxel grid): no quantitative error bars, uncertainty estimates, or statistical significance tests accompany the FPD, high-level observables, or ResNet accuracy figures, despite these being the basis for declaring best-in-class performance and Pareto-frontier status.
[Evaluation] Evaluation and discussion: the sufficiency of FPD, high-level observables, and ResNet classifier accuracy as proxies for physics fidelity is not demonstrated, particularly for localized energy depositions or long-range correlations that may be missed in the large-patch regime on Dataset 3 where small-patch or latent baselines cannot be directly compared.

minor comments (3)

[Methods] Methods section: a brief equation or diagram contrasting x-prediction versus v-prediction (and their loss spaces) would aid readers unfamiliar with the diffusion-model variants.
[Figures] Figures and tables: captions should explicitly state the patch sizes, number of tokens, and exact baseline configurations used for each dataset to support reproducibility.
[Introduction] Introduction: ensure all cited prior calorimeter simulation works (including non-diffusion baselines) are referenced to contextualize the claimed improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our work. We address each of the major comments below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and results on Dataset 3: the claim that x-prediction improves all reported metrics over v-prediction lacks an ablation isolating its effect from other changes (3D positional encoding, architectural refinements). This attribution is load-bearing for the central claim that x-prediction is the key enabler of the compute-efficient route.

Authors: The x- versus v-prediction comparison is performed using the identical CaloArt architecture on Dataset 3, with the only variation being the prediction target. This setup isolates the contribution of x-prediction from the other architectural elements. We will revise the abstract and results section to make this isolation explicit and add a note clarifying the controlled nature of the comparison. revision: partial
Referee: [Results] Results section (Dataset 3, 40500-voxel grid): no quantitative error bars, uncertainty estimates, or statistical significance tests accompany the FPD, high-level observables, or ResNet accuracy figures, despite these being the basis for declaring best-in-class performance and Pareto-frontier status.

Authors: We concur that uncertainty quantification is important for substantiating the performance claims. The revised version will incorporate error bars on all quantitative metrics, calculated from multiple runs with varied seeds. Statistical tests will be included to assess the significance of improvements over baselines. revision: yes
Referee: [Evaluation] Evaluation and discussion: the sufficiency of FPD, high-level observables, and ResNet classifier accuracy as proxies for physics fidelity is not demonstrated, particularly for localized energy depositions or long-range correlations that may be missed in the large-patch regime on Dataset 3 where small-patch or latent baselines cannot be directly compared.

Authors: FPD, high-level observables, and classifier-based metrics are the standard evaluation suite for the CaloChallenge and have been validated in the community for assessing shower fidelity. Our model outperforms or matches competitors on these for the high-granularity Dataset 3. We will enhance the discussion to note the limitations of these proxies regarding fine-scale features and long-range effects in the large-patch setting, while emphasizing that the chosen approach enables practical generation speeds without latent spaces. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external metrics

full rationale

The paper trains a DiT-style model via conditional flow matching on voxel data and reports performance via FPD, high-level shower observables, and ResNet classifier accuracy. These evaluation metrics are independent of the training loss and are not redefined or fitted within the claimed derivation. No equations, self-citations, or ansatzes reduce the reported improvements (e.g., x-prediction gains on Dataset 3) to the inputs by construction. The large-patch choice is dictated by voxel count (40500), not by a self-referential uniqueness theorem. The central claim therefore remains an experimental outcome rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that standard diffusion transformer components plus the listed refinements are sufficient; no new physical axioms or invented particles are introduced.

pith-pipeline@v0.9.0 · 5528 in / 1225 out tokens · 41269 ms · 2026-05-13T04:35:11.953406+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 19 internal anchors

[1]

Agostinelli, J

S. Agostinelli, J. Allison, K. Amako, J. Apostolakis, H. Araujo, P. Arce et al.,Geant4—a simulation toolkit,Nucl. Instrum. Meth. A506(2003) 250

work page 2003
[2]

Allison, K

J. Allison, K. Amako, J. Apostolakis, P. Arce, M. Asai, T. Aso et al.,Recent developments in Geant4,Nucl. Instrum. Meth. A835(2016) 186

work page 2016
[3]

Wiehe,The CMS high granularity calorimeter for the High Luminosity LHC,Nucl

M. Wiehe,The CMS high granularity calorimeter for the High Luminosity LHC,Nucl. Instrum. Meth. A1041(2022) 167312

work page 2022
[4]

ALICE Collaboration,Technical Design Report of the ALICE Forward Calorimeter (FoCal), Tech. Rep. CERN-LHCC-2024-004, ALICE-TDR-022, CERN, Geneva (2024)

work page 2024
[5]

Sefkow, A

F. Sefkow, A. White, K. Kawagoe, R. Pöschl and J. Repond,Experimental tests of particle flow calorimetry,Rev. Mod. Phys.88(2016) 015003

work page 2016
[6]

ILD Collaboration,International Large Detector: Interim Design Report, Tech. Rep. DESY-20-034, KEK-2019-57, AIDA-2020-NOTE-2020-004, DESY, Hamburg (2020), DOI

work page 2019
[7]

Aleksa, F

M. Aleksa, F. Bedeschi, R. Ferrari, F. Sefkow and C.G. Tully,Calorimetry at FCC-ee,Eur. Phys. J. Plus136(2021) 1066

work page 2021
[8]

Buhmann, S

E. Buhmann, S. Diefenbacher, E. Eren, F. Gaede, G. Kasieczka, A. Korol et al.,CaloClouds: Fast Geometry-Independent Highly-Granular Calorimeter Simulation,J. Instrum.18(2023) P11025

work page 2023
[9]

Buhmann, F

E. Buhmann, F. Gaede, G. Kasieczka, A. Korol, W. Korcari, K. Krüger et al.,CaloClouds II: Ultra-Fast Geometry-Independent Highly-Granular Calorimeter Simulation,J. Instrum.19 (2024) P04020

work page 2024
[10]

T. Buss, H. Day-Hall, F. Gaede, G. Kasieczka, K. Krüger, A. Korol et al.,CaloClouds3: Ultra-Fast Geometry-Independent Highly-Granular Calorimeter Simulation,J. Instrum.21 (2026) P03018

work page 2026
[11]

J. Birk, F. Gaede, A. Hallin, G. Kasieczka, M. Mozzanica and H. Rose,OmniJet-αC: learning point cloud calorimeter simulations using generative transformers,J. Instrum.20 (2025) P07007

work page 2025
[12]

Krause, M.F

C. Krause, M.F. Giannelli, G. Kasieczka, B. Nachman, D. Salamani, D. Shih et al., CaloChallenge 2022: A Community Challenge for Fast Calorimeter Simulation,Rep. Prog. Phys.88(2025) 116201

work page 2022
[13]

Ronneberger, P

O. Ronneberger, P. Fischer and T. Brox,U-Net: Convolutional Networks for Biomedical Image Segmentation, inMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241, 2015, DOI

work page 2015
[14]

Mikuni and B

V. Mikuni and B. Nachman,Score-based generative models for calorimeter shower simulation,Phys. Rev. D106(2022) 092009

work page 2022
[15]

Mikuni and B

V. Mikuni and B. Nachman,CaloScore v2: Single-shot Calorimeter Shower Simulation with Diffusion Models,J. Instrum.19(2024) P02001

work page 2024
[16]

Amram and K

O. Amram and K. Pedro,Denoising diffusion models with geometry adaptation for high fidelity calorimeter simulation,Phys. Rev. D108(2023) 072014

work page 2023
[17]

Favaro, A

L. Favaro, A. Ore, S.P. Schweitzer and T. Plehn,CaloDREAM – Detector Response Emulation via Attentive flow Matching,SciPost Phys.18(2025) 088. – 26 –

work page 2025
[18]

A universal vision transformer for fast calorimeter simulations

L. Favaro, A. Giammanco and C. Krause, “A universal vision transformer for fast calorimeter simulations.” 2026, arXiv:2601.05289

work page arXiv 2026
[19]

Madula and V.M

T. Madula and V.M. Mikuni,CaloLatent: Score-based Generative Modelling in the Latent Space for Calorimeter Shower Generation, inMachine Learning and the Physical Sciences Workshop, NeurIPS, 2023, https://nips.cc/virtual/2023/76094

work page 2023
[20]

Calo-VQ: Vector-Quantized Two-Stage Generative Model in Calorimeter Simulation

Q. Liu, C. Shimmin, X. Liu, E. Shlizerman, S. Li and S.-C. Hsu, “Calo-VQ: Vector-Quantized Two-Stage Generative Model in Calorimeter Simulation.” 2024, arXiv:2405.06605

work page arXiv 2024
[21]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman,Very Deep Convolutional Networks for Large-Scale Image Recognition, inInternational Conference on Learning Representations, 2015 arXiv:1409.1556

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov et al.,DINOv2: Learning Robust Visual Features without Supervision,Trans. Mach. Learn. Res.(2024) arXiv:2304.07193

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Back to Basics: Let Denoising Generative Models Denoise

T. Li and K. He, “Back to Basics: Let Denoising Generative Models Denoise.” 2025, arXiv:2511.13720

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Peebles and S

W. Peebles and S. Xie,Scalable Diffusion Models with Transformers, inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023, DOI

work page 2023
[25]

J. Ho, A. Jain and P. Abbeel,Denoising Diffusion Probabilistic Models, inAdvances in Neural Information Processing Systems, vol. 33, 2020 arXiv:2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

J. Song, C. Meng and S. Ermon,Denoising Diffusion Implicit Models, inInternational Conference on Learning Representations, 2021 arXiv:2010.02502

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Y. Song, J. Sohl-Dickstein, D.P. Kingma, A. Kumar, S. Ermon and B. Poole,Score-Based Generative Modeling through Stochastic Differential Equations, inInternational Conference on Learning Representations, 2021 arXiv:2011.13456

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Progressive Distillation for Fast Sampling of Diffusion Models

T. Salimans and J. Ho,Progressive Distillation for Fast Sampling of Diffusion Models, in International Conference on Learning Representations, 2022 arXiv:2202.00512

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Flow Matching for Generative Modeling

Y. Lipman, R.T.Q. Chen, H. Ben-Hamu, M. Nickel and M. Le,Flow Matching for Generative Modeling, inInternational Conference on Learning Representations, 2023 arXiv:2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

X. Liu, C. Gong and Q. Liu,Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, inInternational Conference on Learning Representations, 2023 arXiv:2209.03003

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Building Normalizing Flows with Stochastic Interpolants

M.S. Albergo and E. Vanden-Eijnden,Building Normalizing Flows with Stochastic Interpolants, inInternational Conference on Learning Representations, 2023 arXiv:2209.15571

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Vision: A Computational Investigation into the Human Representation and Processing of Visual Information

O. Chapelle, B. Schölkopf and A. Zien, eds.,Semi-Supervised Learning, MIT Press (2006), 10.7551/mitpress/9780262033589.001.0001

work page doi:10.7551/mitpress/9780262033589.001.0001 2006
[33]

Carlsson,Topology and data,Bull

G. Carlsson,Topology and data,Bull. Amer. Math. Soc.46(2009) 255

work page 2009
[34]

Vincent, H

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio and P.-A. Manzagol,Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion,J. Mach. Learn. Res.11(2010) 3371

work page 2010
[35]

High-Resolution Image Synthesis with Latent Diffusion Models

R. Rombach, A. Blattmann, D. Lorenz, P. Esser and B. Ommer,High-Resolution Image Synthesis with Latent Diffusion Models, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022 arXiv:2112.10752. – 27 –

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Diffusion Models Beat GANs on Image Synthesis

P. Dhariwal and A.Q. Nichol,Diffusion Models Beat GANs on Image Synthesis, inAdvances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021 arXiv:2105.05233

work page internal anchor Pith review arXiv 2021
[37]

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

K. Crowson, S.A. Baumann, A. Birch, T.M. Abraham, D.Z. Kaplan and E. Shippole, “Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers.” 2024, arXiv:2401.11605

work page arXiv 2024
[38]

PixelDiT: Pixel Diffusion Transformers for Image Generation

Y. Yu, W. Xiong, W. Nie, Y. Sheng, S. Liu and J. Luo, “PixelDiT: Pixel Diffusion Transformers for Image Generation.” 2025, arXiv:2511.20645

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini et al.,Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, inProceedings of the 41st International Conference on Machine Learning, vol. 235 ofProceedings of Machine Learning Research, pp. 12606–12633, 2024 arXiv:2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

A Generalisable Generative Model for Multi-Detector Calorimeter Simulation

P. Raikwar, A. Zaborowska, P. McKeown, R. Cardoso, M. Piorczynski and K. Yeo, “A Generalisable Generative Model for Multi-Detector Calorimeter Simulation.” 2025, arXiv:2509.07700

work page arXiv 2025
[41]

Favaro, A

L. Favaro, A. Giammanco and C. Krause,Fast, accurate, and precise detector simulation with vision transformers, in2nd European AI for Fundamental Physics Conference, 2025 arXiv:2509.25169

work page arXiv 2025
[42]

J. Chen, C. Ge, E. Xie et al.,Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, inICLR, 2024 arXiv:2310.00426

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen and Y. Liu,RoFormer: Enhanced Transformer with Rotary Position Embedding,Neurocomputing568(2024) 127063

work page 2024
[44]

J. Yao, B. Yang and X. Wang,Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models, inCVPR, 2025 arXiv:2501.01423

work page arXiv 2025
[45]

GLU Variants Improve Transformer

N. Shazeer, “GLU Variants Improve Transformer.” 2020, arXiv:2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020
[46]

Zhang and R

B. Zhang and R. Sennrich,Root Mean Square Layer Normalization, inAdvances in Neural Information Processing Systems, vol. 32, 2019 arXiv:1910.07467

work page arXiv 2019
[47]

Henry, P.R

A. Henry, P.R. Dachapally, S.S. Pawar and Y. Chen,Query-Key Normalization for Transformers, inFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 4246–4253, 2020, DOI

work page 2020
[48]

Krause and D

C. Krause and D. Shih,CaloFlow: Fast and Accurate Generation of Calorimeter Showers with Normalizing Flows,Phys. Rev. D107(2023) 113003

work page 2023
[49]

T. Buss, F. Gaede, G. Kasieczka, C. Krause and D. Shih,Convolutional L2LFlows: generating accurate showers in highly granular calorimeters using convolutional normalizing flows,J. Instrum.19(2024) P09003

work page 2024
[50]

Kansal, A

R. Kansal, A. Li, J. Duarte, N. Chernyavskaya, M. Pierini, B. Orzari et al.,Evaluating generative models in high energy physics,Phys. Rev. D107(2023) 076017

work page 2023
[51]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter,GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, inAdvances in Neural Information Processing Systems, vol. 30, 2017 arXiv:1706.08500

work page arXiv 2017
[52]

Bińkowski, D.J

M. Bińkowski, D.J. Sutherland, M. Arbel and A. Gretton,Demystifying MMD GANs, in International Conference on Learning Representations, 2018 arXiv:1801.01401. – 28 –

work page arXiv 2018
[53]

Kansal, C

R. Kansal, C. Pareja, Z. Hao and J. Duarte,JetNet: A Python package for accessing open datasets and benchmarking machine learning methods in high energy physics,J. Open Source Softw.8(2023) 5789

work page 2023
[54]

facebookresearch/DiT issue #14

“facebookresearch/DiT issue #14.” 2023, https://github.com/facebookresearch/DiT/issues/14

work page 2023
[55]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner et al.,An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, inInternational Conference on Learning Representations, 2021 arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[56]

N. Ma, M. Goldstein, M.S. Albergo, N.M. Boffi, E. Vanden-Eijnden and S. Xie,SiT: Exploring Flow and Diffusion-Based Generative Models with Scalable Interpolant Transformers, inComputer Vision – ECCV 2024, pp. 23–40, 2024, DOI

work page 2024
[57]

Rasley, S

J. Rasley, S. Rajbhandari, O. Ruwase and Y. He,DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters, inProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506, 2020, DOI

work page 2020
[58]

ptflops: Flops Counter for Neural Networks in PyTorch

V. Sovrasov, “ptflops: Flops Counter for Neural Networks in PyTorch.” 2024, https://github.com/sovrasov/flops-counter.pytorch

work page 2024
[59]

arXiv:2206.00364 [cs, stat]

T. Karras, M. Aittala, T. Aila and S. Laine,Elucidating the Design Space of Diffusion-Based Generative Models, inAdvances in Neural Information Processing Systems, vol. 35, pp. 26565–26577, 2022 arXiv:2206.00364

work page arXiv 2022
[60]

Butcher,Numerical Methods for Ordinary Differential Equations, John Wiley & Sons, 2 ed

J.C. Butcher,Numerical Methods for Ordinary Differential Equations, John Wiley & Sons, 2 ed. (2008), 10.1002/9780470753767

work page doi:10.1002/9780470753767 2008
[61]

T. Li, Y. Tian, H. Li, M. Deng and K. He,Autoregressive Image Generation without Vector Quantization, inAdvances in Neural Information Processing Systems, vol. 37, 2024, DOI arXiv:2406.11838

work page arXiv 2024
[62]

Miettinen,Nonlinear Multiobjective Optimization, vol

K. Miettinen,Nonlinear Multiobjective Optimization, vol. 12 ofInternational Series in Operations Research & Management Science, Kluwer Academic Publishers (1999), 10.1007/978-1-4615-5563-6

work page doi:10.1007/978-1-4615-5563-6 1999
[63]

Esser, R

P. Esser, R. Rombach and B. Ommer,Taming Transformers for High-Resolution Image Synthesis, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883, 2021 arXiv:2012.09841

work page arXiv 2021
[64]

Johnson, A

J. Johnson, A. Alahi and L. Fei-Fei,Perceptual Losses for Real-Time Style Transfer and Super-Resolution, inEuropean Conference on Computer Vision, pp. 694–711, 2016, DOI

work page 2016
[65]

Efros, Eli Shechtman, and Oliver Wang

R. Zhang, P. Isola, A.A. Efros, E. Shechtman and O. Wang,The Unreasonable Effectiveness of Deep Features as a Perceptual Metric, inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595, 2018 arXiv:1801.03924

work page arXiv 2018
[66]

S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin et al.,Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think, inInternational Conference on Learning Representations, 2025 arXiv:2410.06940

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Photorealistic video generation with diffusion models,

A. Gupta, L. Yu et al.,Photorealistic video generation with diffusion models, inECCV, 2024 arXiv:2312.06662

work page arXiv 2024
[68]

Dit-air: Revisiting the efficiency of diffusion model architecture design in text to image generation

C. Chen, R. Qian, W. Hu et al., “Dit-air: Revisiting the efficiency of diffusion model architecture design in text to image generation.” 2025, arXiv:2503.10618. – 29 –

work page arXiv 2025
[69]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller et al.,SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, inInternational Conference on Learning Representations, 2024 arXiv:2307.01952. – 30 –

work page internal anchor Pith review Pith/arXiv arXiv 2024