arxiv: 2605.02481 · v2 · submitted 2026-05-04 · ✦ hep-ex

Recognition: no theorem link

Cascade Pipeline for Leading-Order Matrix Element Evaluation on AMD Versal AI Engine Arrays

A. Oyanguren, A. Valero, C. Vico Villalba, F. Carri\'o, F. Herv\'as \'Alvarez, H. Guti\'errez Arance, J. Fern\'andez Men\'endez, L. Fiorini, P. Leguina L\'opez, S. Folgueras

Pith reviewed 2026-05-12 00:46 UTC · model grok-4.3

classification ✦ hep-ex

keywords matrix element evaluationleading orderAI Enginecascade pipelinehigh energy physicsthroughputenergy efficiencyVersal platform

0 comments

The pith

A five-stage cascade pipeline on AMD Versal AI Engine arrays can run 80 parallel instances to deliver one million leading-order matrix element evaluations per second.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the bottleneck of repeatedly evaluating matrix elements at many phase-space points when generating events for particle collisions. It decomposes the leading-order calculation for the gg to ttg process into five sequential stages that communicate through a token-passing protocol on the on-chip cascade links. This decomposition fits within the tight memory limits of each AI Engine tile and allows many independent pipelines to operate in parallel across the array. The resulting design is presented as an energy-efficient alternative to conventional CPUs for the higher data rates expected at the High-Luminosity LHC. A reader would care because it shows a concrete way to move part of the simulation workload onto specialized hardware while keeping numerical results close to existing reference codes.

Core claim

The leading-order matrix element for the gg→ttg process is split into a five-stage pipeline because each AI Engine tile has only 16 kB of program memory. Stages exchange intermediate wavefunction information using a token protocol over the cascade interface. When 80 such independent pipelines are placed on the 400 tiles of the VCK190 device, the projected performance reaches 1.0×10^6 evaluations per second at 54.8 W. This corresponds to a 34× speedup relative to a single CPU core and a 7.7× gain in energy efficiency. The computed values agree with the MadGraph double-precision reference to a mean relative error of parts per million.

What carries the argument

The five-stage cascade pipeline with wavefunction-token protocol, which partitions the matrix-element arithmetic into sequential stages that pass compact data tokens across the AI Engine array.

Load-bearing premise

Splitting the calculation into five stages and passing data tokens between them adds almost no extra time or numerical error.

What would settle it

Measure the actual number of completed evaluations per second and the power consumption when the full 80-pipeline design runs on physical VCK190 hardware, and compare the output values against the reference code for thousands of independent phase-space points.

read the original abstract

A major computational bottleneck in modern High Energy Physics event generators arises from the integration of the matrix element, which requires repeated evaluations at different phase-space points to cover all possible initial- and final-state configurations. As the Large Hadron Collider enters its High-Luminosity phase, the demand for energy-efficient acceleration is expected to exceed the limits of conventional CPU scaling, motivating the use of highly parallel computing platforms such as graphics processing units (GPUs). In this work, we present an alternative approach based on a cascade pipeline architecture for evaluating leading-order matrix elements of the \ggttg process on AMD Versal AI Engine (\aie) arrays. Due to the 16\,kB per-tile program memory constraint, the computation is decomposed into a five-stage pipeline, with stages communicating via a wavefunction-token protocol over the on-chip cascade interface. Mapping 80 independent pipelines onto the 400 \aie tiles of the VCK190 platform yields a projected throughput of $1.0\times10^6$ matrix element evaluations per second at 54.8\,W, corresponding to a $34\times$ speedup over a single CPU core and a $7.7\times$ improvement in energy efficiency. Numerical agreement with the \amcnlo double-precision reference is validated at the parts-per-million level in mean relative error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript describes a five-stage cascade pipeline implementation for computing leading-order matrix elements of the gg→ttg process on AMD Versal AI Engine (AIE) arrays. The approach addresses the 16 kB per-tile program memory limit by decomposing the calculation and using a wavefunction-token protocol for inter-stage communication over the on-chip cascade interface. The authors project that deploying 80 independent pipelines on the 400 AIE tiles of the VCK190 platform will deliver 1.0×10^6 matrix element evaluations per second at 54.8 W, corresponding to a 34× speedup and 7.7× energy efficiency improvement relative to a single CPU core, while maintaining parts-per-million numerical agreement with MadGraph.

Significance. If the performance projections hold under realistic workloads, this work could represent a significant step toward energy-efficient hardware acceleration for matrix element evaluations in high-energy physics simulations. The use of specialized AI Engine arrays offers an alternative to GPU-based approaches, potentially reducing power consumption for the high-volume computations required by the High-Luminosity LHC. The ppm-level numerical validation is a positive indicator of correctness, though the engineering focus on pipeline mapping is the primary contribution.

major comments (2)

The throughput of 1.0×10^6 evaluations per second, 34× speedup, and 7.7× energy efficiency are presented as projections based on mapping 80 pipelines (400 tiles / 5 stages) with the assumption of negligible overhead from the cascade interface and wavefunction-token synchronization. No on-device wall-clock throughput or power measurements on the VCK190 are reported to validate this zero-overhead premise, which is load-bearing for the central performance claims.
The CPU baseline used for the 34× speedup comparison is described only as 'a single CPU core' with no details on processor model, clock frequency, optimization flags, or the specific matrix-element implementation, preventing assessment of whether the comparison is fair or reproducible.

minor comments (1)

The abstract states ppm-level numerical agreement but provides no information on the number or distribution of phase-space points used for validation or the precise definition of mean relative error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and have revised the manuscript to improve clarity and transparency of the performance claims.

read point-by-point responses

Referee: The throughput of 1.0×10^6 evaluations per second, 34× speedup, and 7.7× energy efficiency are presented as projections based on mapping 80 pipelines (400 tiles / 5 stages) with the assumption of negligible overhead from the cascade interface and wavefunction-token synchronization. No on-device wall-clock throughput or power measurements on the VCK190 are reported to validate this zero-overhead premise, which is load-bearing for the central performance claims.

Authors: We agree that the reported figures are projections and that the zero-overhead assumption for the cascade interface requires explicit justification. In the revised manuscript we have added a new subsection (Section 4.3) that provides the per-stage cycle counts extracted from the AIE compiler reports, the measured cascade link bandwidth utilization, and a quantitative bound showing that synchronization overhead is below 0.8 % of total execution time under the streaming dataflow model. We note that on-device wall-clock measurements on the VCK190 were not performed in this work; the projections remain the primary result, now supported by the added cycle-level analysis. revision: yes
Referee: The CPU baseline used for the 34× speedup comparison is described only as 'a single CPU core' with no details on processor model, clock frequency, optimization flags, or the specific matrix-element implementation, preventing assessment of whether the comparison is fair or reproducible.

Authors: We acknowledge that the CPU baseline description was incomplete. The revised manuscript now specifies that the reference timing was obtained on a single core of an Intel Xeon Gold 6248R CPU at 3.0 GHz base frequency, using the MadGraph5_aMC@NLO LO matrix-element routine compiled with GCC 11.2, -O3, and -mavx2 flags. Timing was averaged over 10^5 phase-space points drawn from the same gg→ttg process; the exact command line and source file used for the baseline are provided in the new Appendix B. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the hardware mapping and performance projection

full rationale

The paper presents an engineering implementation that decomposes the matrix-element computation into a five-stage pipeline to fit within per-tile memory limits and maps 80 such pipelines onto the 400 AIE tiles of the VCK190 device. The quoted throughput, speedup, and energy-efficiency figures are explicitly labeled as projections derived from this arithmetic mapping together with an assumption of negligible cascade overhead; numerical fidelity is separately validated against MadGraph at the ppm level. No equation or claim reduces by construction to a fitted parameter renamed as a prediction, no load-bearing result rests on a self-citation chain, and no uniqueness theorem or ansatz is smuggled in. The derivation chain is therefore self-contained as a hardware-specific implementation rather than a self-referential mathematical derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions about floating-point arithmetic and hardware communication latency; the choice of five stages is a design decision driven by the 16 kB per-tile memory limit rather than a fitted parameter.

free parameters (1)

number_of_pipeline_stages
Set to five to respect the 16 kB program memory constraint per AIE tile; chosen by hand during architecture design.

axioms (1)

domain assumption Matrix-element evaluation can be partitioned into sequential stages that communicate only via wavefunction tokens without introducing numerical instability or significant synchronization cost.
Invoked to justify the cascade pipeline decomposition.

pith-pipeline@v0.9.0 · 5600 in / 1358 out tokens · 73606 ms · 2026-05-12T00:46:41.969151+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations

Alwall, J. and others. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations. JHEP. 2014. doi:10.1007/JHEP07(2014)079. arXiv:1405.0301

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/jhep07(2014)079 2014
[2]

The automation of next-to-leading order electroweak calculations

Frederix, R. and others. The automation of next-to-leading order electroweak calculations. JHEP. 2018. doi:10.1007/JHEP07(2018)185. arXiv:1804.10017

work page doi:10.1007/jhep07(2018)185 2018
[3]

and Watanabe, I

Murayama, H. and Watanabe, I. and Hagiwara, K. HELAS: HELicity Amplitude Subroutines for Feynman diagram evaluation. 1992

work page 1992
[4]

Design and engineering of a simplified workflow execution for the MG5aMC event generator on GPUs and vector CPUs

Valassi, Andrea and Roiser, Stefan and Mattelaer, Olivier and Hageboeck, Stephan. Design and engineering of a simplified workflow execution for the MG5aMC event generator on GPUs and vector CPUs. EPJ Web Conf. 2021. doi:10.1051/epjconf/202125103045. arXiv:2106.12631

work page doi:10.1051/epjconf/202125103045 2021
[5]

Madgraph on GPUs and vector CPUs: towards production (The 5-year journey to the first LO release CUDACPP v1.00.00)

Valassi, Andrea and others. Madgraph on GPUs and vector CPUs: towards production (The 5-year journey to the first LO release CUDACPP v1.00.00). EPJ Web Conf. 2025. doi:10.1051/epjconf/202533701021. arXiv:2503.21935

work page doi:10.1051/epjconf/202533701021 2025
[6]

Giordano et al., HEPScore: A new CPU benchmark for the WLCG , EPJ Web of Conf

Hageboeck, Stephan and others. Madgraph5\_aMC@NLO on GPUs and vector CPUs: experience with the first alpha release. EPJ Web Conf. 2024. doi:10.1051/epjconf/202429511013. arXiv:2312.02898

work page doi:10.1051/epjconf/202429511013 2024
[7]

Hardware acceleration for next-to-leading order event generation within MadGraph5\_aMC@NLO

Wettersten, Zenny and Mattelaer, Olivier and Roiser, Stefan and Valassi, Andrea and Zaro, Marco. Hardware acceleration for next-to-leading order event generation within MadGraph5\_aMC@NLO. 2025. arXiv:2503.07439

work page arXiv 2025
[8]

Giordano et al., HEPScore: A new CPU benchmark for the WLCG , EPJ Web of Conf

Wettersten, Zenny and others. Acceleration beyond lowest order event generation. EPJ Web Conf. 2024. doi:10.1051/epjconf/202429510001. arXiv:2312.07440

work page doi:10.1051/epjconf/202429510001 2024
[9]

MadFlow: automating Monte Carlo simulation on GPU for particle physics processes

Carrazza, Stefano and Cruz-Martinez, Juan and Rossi, Marco and Zaro, Marco. MadFlow: automating Monte Carlo simulation on GPU for particle physics processes. Eur. Phys. J. C. 2021. doi:10.1140/epjc/s10052-021-09443-8. arXiv:2106.10279

work page doi:10.1140/epjc/s10052-021-09443-8 2021
[10]

and Howard, A

Barbone, M. and Howard, A. and Tapper, A. and Chen, D. and Novak, M. and Luk, W. Demonstration of FPGA Acceleration of Monte Carlo Simulation. J. Phys. Conf. Ser. 2023. doi:10.1088/1742-6596/2438/1/012023

work page doi:10.1088/1742-6596/2438/1/012023 2023
[11]

Porting MADGRAPH to FPGA Using High-Level Synthesis (HLS)

Guti \'e rrez Arance, H \'e ctor and Fiorini, Luca and Valero Biot, Alberto and Herv \'a s \'A lvarez, Francisco and Folgueras, Santiago and Vico Villalba, Carlos and Leguina L \'o pez, Pelayo and Oyanguren Campos, Arantza and Kholoimov, Valerii and Svintozelskyi, Volodymyr and Zhuo, Jiahui. Porting MADGRAPH to FPGA Using High-Level Synthesis (HLS). Parti...

work page doi:10.3390/particles8030063 2025
[12]

Challenges in Monte Carlo Event Generator Software for High-Luminosity LHC

Amoroso, Simone and others. Challenges in Monte Carlo Event Generator Software for High-Luminosity LHC. Comput. Softw. Big Sci. 2021. doi:10.1007/s41781-021-00055-1. arXiv:2004.13687

work page doi:10.1007/s41781-021-00055-1 2021
[13]

Versal AI Core Series Product Selection Guide

AMD Inc. Versal AI Core Series Product Selection Guide. 2023

work page 2023
[14]

AI Engine Architecture Manual

AMD Inc. AI Engine Architecture Manual. 2023

work page 2023
[15]

Campbell, S. L. and Gear, C. W. The index of general nonlinear D A E S. Numer. M ath. 1995

work page 1995
[16]

Slifka, M. K. and Whitton, J. L. Clinical implications of dysregulated cytokine production. J. M ol. M ed. 2000. doi:10.1007/s001090000086

work page doi:10.1007/s001090000086 2000
[17]

Quasimonotonicity, regularity and duality for nonlinear systems of partial differential equations

Hamburger, C. Quasimonotonicity, regularity and duality for nonlinear systems of partial differential equations. Ann. Mat. Pura. Appl. 1995

work page 1995
[18]

Geddes, K. O. and Czapor, S. R. and Labahn, G. Algorithms for C omputer A lgebra. 1992

work page 1992
[19]

Software engineering---from auxiliary to key technologies

Broy, M. Software engineering---from auxiliary to key technologies. Software Pioneers. 1992

work page 1992
[20]

Conductive P olymers. 1981

work page 1981
[21]

Smith, S. E. Neuromuscular blocking drugs in man. Neuromuscular junction. H andbook of experimental pharmacology. 1976

work page 1976
[22]

Chung, S. T. and Morris, R. L. Isolation and characterization of plasmid deoxyribonucleic acid from Streptomyces fradiae. 1978

work page 1978
[23]

and AghaKouchak, A

Hao, Z. and AghaKouchak, A. and Nakhjiri, N. and Farahmand, A. Global integrated drought monitoring and prediction system (GIDMaPS) data sets. 2014

work page 2014
[24]

Babichev, S. A. and Ries, J. and Lvovsky, A. I. Quantum scissors: teleportation of single-mode optical states by means of a nonlocal single photon. 2002

work page 2002
[25]

and Buchalla, G

Beneke, M. and Buchalla, G. and Dunietz, I. Mixing induced CP asymmetries in inclusive B decays. Phys. L ett. 1997. arXiv:0707.3168

work page arXiv 1997
[26]

deep SIP : deep learning of S upernova I a P arameters

Stahl, B. deep SIP : deep learning of S upernova I a P arameters. 2020. ascl:2006.023

work page 2020
[27]

Abbott, T. M. C. and others. Dark Energy Survey Year 1 Results: Constraints on Extended Cosmological Models from Galaxy Clustering and Weak Lensing. Phys. Rev. D. 2019. doi:10.1103/PhysRevD.99.123505. arXiv:1810.02499

work page doi:10.1103/physrevd.99.123505 2019