pith. machine review for the scientific record. sign in

arxiv: 2605.12011 · v1 · submitted 2026-05-12 · ⚛️ physics.ins-det · hep-ex· hep-ph

Recognition: no theorem link

CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

Gongxing Sun, Zhengkun Huang

Pith reviewed 2026-05-13 04:35 UTC · model grok-4.3

classification ⚛️ physics.ins-det hep-exhep-ph
keywords calorimeter shower simulationdiffusion transformerx-predictionflow matchingvoxel generationhigh-granularity detectorgenerative modeling
0
0 comments X

The pith

Large-patch x-prediction in voxel-space diffusion transformers generates high-granularity calorimeter showers efficiently without a latent tokenizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a diffusion transformer architecture that tokenizes the full voxel grid of a calorimeter into large patches and employs x-prediction during training. This setup is applied to two high-dimensional shower datasets, where it produces synthetic events that match or improve upon prior quality measures while keeping generation times around ten milliseconds on a single GPU. The approach is presented as a direct route to raw voxel generation that avoids the extra stage of training a separate tokenizer. A reader would care because calorimeter simulations are a major computational bottleneck in particle physics, and any method that lowers the cost of producing faithful high-resolution showers could speed up experiment design and data analysis.

Core claim

CaloArt is a DiT-style backbone augmented with three-dimensional positional encodings and trained by conditional flow matching in which the prediction target and loss space are decoupled; on the smaller-patch Dataset 2 the model records the strongest Fréchet particle distance, high-level observables, and classifier accuracy among reported methods, while on the 40 500-voxel Dataset 3 the same architecture with large patches and x-prediction improves every metric relative to v-prediction and sits on the quality-versus-time frontier, both variants completing a shower in roughly 10 ms.

What carries the argument

Large-patch tokenization of the raw voxel grid inside a diffusion transformer that uses x-prediction under conditional flow matching.

If this is right

  • On grids small enough for fine patches, x-prediction remains competitive with v-prediction while simplifying the training objective.
  • On grids large enough that only coarse patches are feasible, x-prediction demonstrably raises all evaluated fidelity scores.
  • Single-GPU inference stays below twelve milliseconds per shower for both tested resolutions.
  • The method removes the requirement to train and maintain a separate latent-space tokenizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same large-patch x-prediction recipe could be tested on other high-dimensional physics simulation tasks such as tracking or detector response in different geometries.
  • If the reported metrics continue to correlate with downstream physics analyses, the approach would directly reduce the wall-clock time needed for Monte Carlo campaigns in current and future collider experiments.
  • Extending the architecture to conditional generation on additional variables such as incident particle energy or angle would be a natural next measurement of its flexibility.

Load-bearing premise

The chosen metrics of Fréchet particle distance, high-level observables, and ResNet classifier accuracy are sufficient to certify that the generated showers reproduce the full physics content of real data across all relevant phase-space regions.

What would settle it

A direct comparison in which showers produced by the model deviate from full Geant4 simulation in shower-shape moments, energy-flow correlations, or rare topologies that are invisible to the reported FPD and classifier scores.

read the original abstract

High-granularity calorimeters make ML-based fast shower simulation a high-dimensional generative modeling problem, where voxel-space generators must balance physics fidelity with training and inference cost. This work studies large-patch tokenization with x-prediction, enabling efficient raw voxel generation. We propose CaloArt, a modernized DiT-style backbone with 3D positional encoding and architectural refinements, trained via conditional flow matching with decoupled prediction and loss spaces. On CaloChallenge Dataset 2, where small patch size remains affordable, v-prediction performs well, and CaloArt achieves the best FPD, strongest high-level metrics, and strongest ResNet classifier metrics. On CaloChallenge Dataset 3, the 40500-voxel grid makes large patches necessary; x-prediction improves all reported metrics over v-prediction and places CaloArt on the quality-generation-time Pareto frontier. The final CCD2 and CCD3 models both retain O(10) ms single-GPU generation time, with 9.71 and 11.14 ms per shower. These results support large-patch voxel-space diffusion transformers with x-prediction as a compute-efficient route to high-granularity calorimeter shower synthesis, reducing training and inference cost without a pretrained latent tokenizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces CaloArt, a DiT-style diffusion transformer using large-patch tokenization and x-prediction for raw voxel-space generation of high-granularity calorimeter showers. Trained with conditional flow matching, it reports best-in-class FPD, high-level observables, and ResNet classifier accuracy on CaloChallenge Datasets 2 and 3, with single-GPU generation times of 9.71 ms and 11.14 ms per shower. The central claim is that this large-patch x-prediction approach offers a compute-efficient route to high-fidelity synthesis without a pretrained latent tokenizer.

Significance. If the results hold under scrutiny, the work could meaningfully advance fast shower simulation for high-energy physics, where high-granularity calorimeters demand scalable generative models to replace costly GEANT4 runs. Credit is due for explicit reporting of inference times, use of public benchmark datasets, and direct voxel-space operation that avoids latent tokenizer overhead. The architectural refinements (3D positional encoding, decoupled prediction/loss spaces) are clearly motivated. Significance is limited by the empirical nature of the claims and the need for stronger validation that the chosen proxies capture physics-relevant features across the full shower phase space.

major comments (3)
  1. [Abstract] Abstract and results on Dataset 3: the claim that x-prediction improves all reported metrics over v-prediction lacks an ablation isolating its effect from other changes (3D positional encoding, architectural refinements). This attribution is load-bearing for the central claim that x-prediction is the key enabler of the compute-efficient route.
  2. [Results] Results section (Dataset 3, 40500-voxel grid): no quantitative error bars, uncertainty estimates, or statistical significance tests accompany the FPD, high-level observables, or ResNet accuracy figures, despite these being the basis for declaring best-in-class performance and Pareto-frontier status.
  3. [Evaluation] Evaluation and discussion: the sufficiency of FPD, high-level observables, and ResNet classifier accuracy as proxies for physics fidelity is not demonstrated, particularly for localized energy depositions or long-range correlations that may be missed in the large-patch regime on Dataset 3 where small-patch or latent baselines cannot be directly compared.
minor comments (3)
  1. [Methods] Methods section: a brief equation or diagram contrasting x-prediction versus v-prediction (and their loss spaces) would aid readers unfamiliar with the diffusion-model variants.
  2. [Figures] Figures and tables: captions should explicitly state the patch sizes, number of tokens, and exact baseline configurations used for each dataset to support reproducibility.
  3. [Introduction] Introduction: ensure all cited prior calorimeter simulation works (including non-diffusion baselines) are referenced to contextualize the claimed improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our work. We address each of the major comments below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results on Dataset 3: the claim that x-prediction improves all reported metrics over v-prediction lacks an ablation isolating its effect from other changes (3D positional encoding, architectural refinements). This attribution is load-bearing for the central claim that x-prediction is the key enabler of the compute-efficient route.

    Authors: The x- versus v-prediction comparison is performed using the identical CaloArt architecture on Dataset 3, with the only variation being the prediction target. This setup isolates the contribution of x-prediction from the other architectural elements. We will revise the abstract and results section to make this isolation explicit and add a note clarifying the controlled nature of the comparison. revision: partial

  2. Referee: [Results] Results section (Dataset 3, 40500-voxel grid): no quantitative error bars, uncertainty estimates, or statistical significance tests accompany the FPD, high-level observables, or ResNet accuracy figures, despite these being the basis for declaring best-in-class performance and Pareto-frontier status.

    Authors: We concur that uncertainty quantification is important for substantiating the performance claims. The revised version will incorporate error bars on all quantitative metrics, calculated from multiple runs with varied seeds. Statistical tests will be included to assess the significance of improvements over baselines. revision: yes

  3. Referee: [Evaluation] Evaluation and discussion: the sufficiency of FPD, high-level observables, and ResNet classifier accuracy as proxies for physics fidelity is not demonstrated, particularly for localized energy depositions or long-range correlations that may be missed in the large-patch regime on Dataset 3 where small-patch or latent baselines cannot be directly compared.

    Authors: FPD, high-level observables, and classifier-based metrics are the standard evaluation suite for the CaloChallenge and have been validated in the community for assessing shower fidelity. Our model outperforms or matches competitors on these for the high-granularity Dataset 3. We will enhance the discussion to note the limitations of these proxies regarding fine-scale features and long-range effects in the large-patch setting, while emphasizing that the chosen approach enables practical generation speeds without latent spaces. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external metrics

full rationale

The paper trains a DiT-style model via conditional flow matching on voxel data and reports performance via FPD, high-level shower observables, and ResNet classifier accuracy. These evaluation metrics are independent of the training loss and are not redefined or fitted within the claimed derivation. No equations, self-citations, or ansatzes reduce the reported improvements (e.g., x-prediction gains on Dataset 3) to the inputs by construction. The large-patch choice is dictated by voxel count (40500), not by a self-referential uniqueness theorem. The central claim therefore remains an experimental outcome rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that standard diffusion transformer components plus the listed refinements are sufficient; no new physical axioms or invented particles are introduced.

pith-pipeline@v0.9.0 · 5528 in / 1225 out tokens · 41269 ms · 2026-05-13T04:35:11.953406+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 19 internal anchors

  1. [1]

    Agostinelli, J

    S. Agostinelli, J. Allison, K. Amako, J. Apostolakis, H. Araujo, P. Arce et al.,Geant4—a simulation toolkit,Nucl. Instrum. Meth. A506(2003) 250

  2. [2]

    Allison, K

    J. Allison, K. Amako, J. Apostolakis, P. Arce, M. Asai, T. Aso et al.,Recent developments in Geant4,Nucl. Instrum. Meth. A835(2016) 186

  3. [3]

    Wiehe,The CMS high granularity calorimeter for the High Luminosity LHC,Nucl

    M. Wiehe,The CMS high granularity calorimeter for the High Luminosity LHC,Nucl. Instrum. Meth. A1041(2022) 167312

  4. [4]

    ALICE Collaboration,Technical Design Report of the ALICE Forward Calorimeter (FoCal), Tech. Rep. CERN-LHCC-2024-004, ALICE-TDR-022, CERN, Geneva (2024)

  5. [5]

    Sefkow, A

    F. Sefkow, A. White, K. Kawagoe, R. Pöschl and J. Repond,Experimental tests of particle flow calorimetry,Rev. Mod. Phys.88(2016) 015003

  6. [6]

    ILD Collaboration,International Large Detector: Interim Design Report, Tech. Rep. DESY-20-034, KEK-2019-57, AIDA-2020-NOTE-2020-004, DESY, Hamburg (2020), DOI

  7. [7]

    Aleksa, F

    M. Aleksa, F. Bedeschi, R. Ferrari, F. Sefkow and C.G. Tully,Calorimetry at FCC-ee,Eur. Phys. J. Plus136(2021) 1066

  8. [8]

    Buhmann, S

    E. Buhmann, S. Diefenbacher, E. Eren, F. Gaede, G. Kasieczka, A. Korol et al.,CaloClouds: Fast Geometry-Independent Highly-Granular Calorimeter Simulation,J. Instrum.18(2023) P11025

  9. [9]

    Buhmann, F

    E. Buhmann, F. Gaede, G. Kasieczka, A. Korol, W. Korcari, K. Krüger et al.,CaloClouds II: Ultra-Fast Geometry-Independent Highly-Granular Calorimeter Simulation,J. Instrum.19 (2024) P04020

  10. [10]

    T. Buss, H. Day-Hall, F. Gaede, G. Kasieczka, K. Krüger, A. Korol et al.,CaloClouds3: Ultra-Fast Geometry-Independent Highly-Granular Calorimeter Simulation,J. Instrum.21 (2026) P03018

  11. [11]

    J. Birk, F. Gaede, A. Hallin, G. Kasieczka, M. Mozzanica and H. Rose,OmniJet-αC: learning point cloud calorimeter simulations using generative transformers,J. Instrum.20 (2025) P07007

  12. [12]

    Krause, M.F

    C. Krause, M.F. Giannelli, G. Kasieczka, B. Nachman, D. Salamani, D. Shih et al., CaloChallenge 2022: A Community Challenge for Fast Calorimeter Simulation,Rep. Prog. Phys.88(2025) 116201

  13. [13]

    Ronneberger, P

    O. Ronneberger, P. Fischer and T. Brox,U-Net: Convolutional Networks for Biomedical Image Segmentation, inMedical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241, 2015, DOI

  14. [14]

    Mikuni and B

    V. Mikuni and B. Nachman,Score-based generative models for calorimeter shower simulation,Phys. Rev. D106(2022) 092009

  15. [15]

    Mikuni and B

    V. Mikuni and B. Nachman,CaloScore v2: Single-shot Calorimeter Shower Simulation with Diffusion Models,J. Instrum.19(2024) P02001

  16. [16]

    Amram and K

    O. Amram and K. Pedro,Denoising diffusion models with geometry adaptation for high fidelity calorimeter simulation,Phys. Rev. D108(2023) 072014

  17. [17]

    Favaro, A

    L. Favaro, A. Ore, S.P. Schweitzer and T. Plehn,CaloDREAM – Detector Response Emulation via Attentive flow Matching,SciPost Phys.18(2025) 088. – 26 –

  18. [18]

    A universal vision transformer for fast calorimeter simulations

    L. Favaro, A. Giammanco and C. Krause, “A universal vision transformer for fast calorimeter simulations.” 2026, arXiv:2601.05289

  19. [19]

    Madula and V.M

    T. Madula and V.M. Mikuni,CaloLatent: Score-based Generative Modelling in the Latent Space for Calorimeter Shower Generation, inMachine Learning and the Physical Sciences Workshop, NeurIPS, 2023, https://nips.cc/virtual/2023/76094

  20. [20]

    Calo-VQ: Vector-Quantized Two-Stage Generative Model in Calorimeter Simulation

    Q. Liu, C. Shimmin, X. Liu, E. Shlizerman, S. Li and S.-C. Hsu, “Calo-VQ: Vector-Quantized Two-Stage Generative Model in Calorimeter Simulation.” 2024, arXiv:2405.06605

  21. [21]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman,Very Deep Convolutional Networks for Large-Scale Image Recognition, inInternational Conference on Learning Representations, 2015 arXiv:1409.1556

  22. [22]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov et al.,DINOv2: Learning Robust Visual Features without Supervision,Trans. Mach. Learn. Res.(2024) arXiv:2304.07193

  23. [23]

    Back to Basics: Let Denoising Generative Models Denoise

    T. Li and K. He, “Back to Basics: Let Denoising Generative Models Denoise.” 2025, arXiv:2511.13720

  24. [24]

    Peebles and S

    W. Peebles and S. Xie,Scalable Diffusion Models with Transformers, inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023, DOI

  25. [25]

    J. Ho, A. Jain and P. Abbeel,Denoising Diffusion Probabilistic Models, inAdvances in Neural Information Processing Systems, vol. 33, 2020 arXiv:2006.11239

  26. [26]

    J. Song, C. Meng and S. Ermon,Denoising Diffusion Implicit Models, inInternational Conference on Learning Representations, 2021 arXiv:2010.02502

  27. [27]

    Y. Song, J. Sohl-Dickstein, D.P. Kingma, A. Kumar, S. Ermon and B. Poole,Score-Based Generative Modeling through Stochastic Differential Equations, inInternational Conference on Learning Representations, 2021 arXiv:2011.13456

  28. [28]

    Progressive Distillation for Fast Sampling of Diffusion Models

    T. Salimans and J. Ho,Progressive Distillation for Fast Sampling of Diffusion Models, in International Conference on Learning Representations, 2022 arXiv:2202.00512

  29. [29]

    Flow Matching for Generative Modeling

    Y. Lipman, R.T.Q. Chen, H. Ben-Hamu, M. Nickel and M. Le,Flow Matching for Generative Modeling, inInternational Conference on Learning Representations, 2023 arXiv:2210.02747

  30. [30]

    X. Liu, C. Gong and Q. Liu,Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, inInternational Conference on Learning Representations, 2023 arXiv:2209.03003

  31. [31]

    Building Normalizing Flows with Stochastic Interpolants

    M.S. Albergo and E. Vanden-Eijnden,Building Normalizing Flows with Stochastic Interpolants, inInternational Conference on Learning Representations, 2023 arXiv:2209.15571

  32. [32]

    Vision: A Computational Investigation into the Human Representation and Processing of Visual Information

    O. Chapelle, B. Schölkopf and A. Zien, eds.,Semi-Supervised Learning, MIT Press (2006), 10.7551/mitpress/9780262033589.001.0001

  33. [33]

    Carlsson,Topology and data,Bull

    G. Carlsson,Topology and data,Bull. Amer. Math. Soc.46(2009) 255

  34. [34]

    Vincent, H

    P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio and P.-A. Manzagol,Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion,J. Mach. Learn. Res.11(2010) 3371

  35. [35]

    High-Resolution Image Synthesis with Latent Diffusion Models

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser and B. Ommer,High-Resolution Image Synthesis with Latent Diffusion Models, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022 arXiv:2112.10752. – 27 –

  36. [36]

    Diffusion Models Beat GANs on Image Synthesis

    P. Dhariwal and A.Q. Nichol,Diffusion Models Beat GANs on Image Synthesis, inAdvances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021 arXiv:2105.05233

  37. [37]

    Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

    K. Crowson, S.A. Baumann, A. Birch, T.M. Abraham, D.Z. Kaplan and E. Shippole, “Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers.” 2024, arXiv:2401.11605

  38. [38]

    PixelDiT: Pixel Diffusion Transformers for Image Generation

    Y. Yu, W. Xiong, W. Nie, Y. Sheng, S. Liu and J. Luo, “PixelDiT: Pixel Diffusion Transformers for Image Generation.” 2025, arXiv:2511.20645

  39. [39]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini et al.,Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, inProceedings of the 41st International Conference on Machine Learning, vol. 235 ofProceedings of Machine Learning Research, pp. 12606–12633, 2024 arXiv:2403.03206

  40. [40]

    A Generalisable Generative Model for Multi-Detector Calorimeter Simulation

    P. Raikwar, A. Zaborowska, P. McKeown, R. Cardoso, M. Piorczynski and K. Yeo, “A Generalisable Generative Model for Multi-Detector Calorimeter Simulation.” 2025, arXiv:2509.07700

  41. [41]

    Favaro, A

    L. Favaro, A. Giammanco and C. Krause,Fast, accurate, and precise detector simulation with vision transformers, in2nd European AI for Fundamental Physics Conference, 2025 arXiv:2509.25169

  42. [42]

    J. Chen, C. Ge, E. Xie et al.,Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, inICLR, 2024 arXiv:2310.00426

  43. [43]

    J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen and Y. Liu,RoFormer: Enhanced Transformer with Rotary Position Embedding,Neurocomputing568(2024) 127063

  44. [44]

    J. Yao, B. Yang and X. Wang,Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models, inCVPR, 2025 arXiv:2501.01423

  45. [45]

    GLU Variants Improve Transformer

    N. Shazeer, “GLU Variants Improve Transformer.” 2020, arXiv:2002.05202

  46. [46]

    Zhang and R

    B. Zhang and R. Sennrich,Root Mean Square Layer Normalization, inAdvances in Neural Information Processing Systems, vol. 32, 2019 arXiv:1910.07467

  47. [47]

    Henry, P.R

    A. Henry, P.R. Dachapally, S.S. Pawar and Y. Chen,Query-Key Normalization for Transformers, inFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 4246–4253, 2020, DOI

  48. [48]

    Krause and D

    C. Krause and D. Shih,CaloFlow: Fast and Accurate Generation of Calorimeter Showers with Normalizing Flows,Phys. Rev. D107(2023) 113003

  49. [49]

    T. Buss, F. Gaede, G. Kasieczka, C. Krause and D. Shih,Convolutional L2LFlows: generating accurate showers in highly granular calorimeters using convolutional normalizing flows,J. Instrum.19(2024) P09003

  50. [50]

    Kansal, A

    R. Kansal, A. Li, J. Duarte, N. Chernyavskaya, M. Pierini, B. Orzari et al.,Evaluating generative models in high energy physics,Phys. Rev. D107(2023) 076017

  51. [51]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter,GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, inAdvances in Neural Information Processing Systems, vol. 30, 2017 arXiv:1706.08500

  52. [52]

    Bińkowski, D.J

    M. Bińkowski, D.J. Sutherland, M. Arbel and A. Gretton,Demystifying MMD GANs, in International Conference on Learning Representations, 2018 arXiv:1801.01401. – 28 –

  53. [53]

    Kansal, C

    R. Kansal, C. Pareja, Z. Hao and J. Duarte,JetNet: A Python package for accessing open datasets and benchmarking machine learning methods in high energy physics,J. Open Source Softw.8(2023) 5789

  54. [54]

    facebookresearch/DiT issue #14

    “facebookresearch/DiT issue #14.” 2023, https://github.com/facebookresearch/DiT/issues/14

  55. [55]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner et al.,An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, inInternational Conference on Learning Representations, 2021 arXiv:2010.11929

  56. [56]

    N. Ma, M. Goldstein, M.S. Albergo, N.M. Boffi, E. Vanden-Eijnden and S. Xie,SiT: Exploring Flow and Diffusion-Based Generative Models with Scalable Interpolant Transformers, inComputer Vision – ECCV 2024, pp. 23–40, 2024, DOI

  57. [57]

    Rasley, S

    J. Rasley, S. Rajbhandari, O. Ruwase and Y. He,DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters, inProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506, 2020, DOI

  58. [58]

    ptflops: Flops Counter for Neural Networks in PyTorch

    V. Sovrasov, “ptflops: Flops Counter for Neural Networks in PyTorch.” 2024, https://github.com/sovrasov/flops-counter.pytorch

  59. [59]

    arXiv:2206.00364 [cs, stat]

    T. Karras, M. Aittala, T. Aila and S. Laine,Elucidating the Design Space of Diffusion-Based Generative Models, inAdvances in Neural Information Processing Systems, vol. 35, pp. 26565–26577, 2022 arXiv:2206.00364

  60. [60]

    Butcher,Numerical Methods for Ordinary Differential Equations, John Wiley & Sons, 2 ed

    J.C. Butcher,Numerical Methods for Ordinary Differential Equations, John Wiley & Sons, 2 ed. (2008), 10.1002/9780470753767

  61. [61]

    T. Li, Y. Tian, H. Li, M. Deng and K. He,Autoregressive Image Generation without Vector Quantization, inAdvances in Neural Information Processing Systems, vol. 37, 2024, DOI arXiv:2406.11838

  62. [62]

    Miettinen,Nonlinear Multiobjective Optimization, vol

    K. Miettinen,Nonlinear Multiobjective Optimization, vol. 12 ofInternational Series in Operations Research & Management Science, Kluwer Academic Publishers (1999), 10.1007/978-1-4615-5563-6

  63. [63]

    Esser, R

    P. Esser, R. Rombach and B. Ommer,Taming Transformers for High-Resolution Image Synthesis, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883, 2021 arXiv:2012.09841

  64. [64]

    Johnson, A

    J. Johnson, A. Alahi and L. Fei-Fei,Perceptual Losses for Real-Time Style Transfer and Super-Resolution, inEuropean Conference on Computer Vision, pp. 694–711, 2016, DOI

  65. [65]

    Efros, Eli Shechtman, and Oliver Wang

    R. Zhang, P. Isola, A.A. Efros, E. Shechtman and O. Wang,The Unreasonable Effectiveness of Deep Features as a Perceptual Metric, inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595, 2018 arXiv:1801.03924

  66. [66]

    S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin et al.,Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think, inInternational Conference on Learning Representations, 2025 arXiv:2410.06940

  67. [67]

    Photorealistic video generation with diffusion models,

    A. Gupta, L. Yu et al.,Photorealistic video generation with diffusion models, inECCV, 2024 arXiv:2312.06662

  68. [68]

    Dit-air: Revisiting the efficiency of diffusion model architecture design in text to image generation

    C. Chen, R. Qian, W. Hu et al., “Dit-air: Revisiting the efficiency of diffusion model architecture design in text to image generation.” 2025, arXiv:2503.10618. – 29 –

  69. [69]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller et al.,SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, inInternational Conference on Learning Representations, 2024 arXiv:2307.01952. – 30 –