Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

Alexander Grohsjean; Christian Schwanenberger; Finn Labe; J\"orn Bach; Laurids Jeppe; Mads H. Baattrup; Peer Stelldinger

arxiv: 2605.22891 · v1 · pith:C3YG3ETQnew · submitted 2026-05-21 · 💻 cs.LG · hep-ex

Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

Mads H. Baattrup , J\"orn Bach , Laurids Jeppe , Finn Labe , Alexander Grohsjean , Christian Schwanenberger , Peer Stelldinger This is my paper

Pith reviewed 2026-05-25 05:37 UTC · model grok-4.3

classification 💻 cs.LG hep-ex

keywords multimodal inverse problemspointwise metricsevaluation protocolposterior spectrumCRPSuncertainty calibrationparticle physics reconstruction

0 comments

The pith

Point estimators minimizing MSE or MAE always produce narrower marginal spectra than the true posterior in multimodal inverse problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluation of scientific reconstructions relies heavily on pointwise metrics such as RMSE and MAE. The paper shows this leads to systematic bias because, by the law of total variance, any such point estimator compresses the spectrum whenever the posterior has width. This compression hides the very features like tails and modes that matter for later measurements. To address it, the authors introduce a protocol with three checks: distributional accuracy per event using CRPS, overall marginal spectrum match, and proper uncertainty calibration. Experiments on synthetic data and a particle physics inverse problem demonstrate that conclusions about which model is best can flip depending on the evaluation method used.

Core claim

The central discovery is that pointwise metrics are structurally misleading for inverse problems with multimodal posteriors. By the law of total variance, point estimators trained to minimize MSE or MAE produce a marginal spectrum strictly narrower than the truth. The bias is independent of architecture, training, and dataset size. A three-part protocol is proposed: CRPS for per-event accuracy, a spectrum-fidelity diagnostic for population marginals, and coverage calibration for uncertainty. On benchmarks, model rankings reverse and calibration distinguishes further.

What carries the argument

The law of total variance decomposition showing that point predictions from a multimodal posterior must have strictly smaller marginal variance than the true distribution.

If this is right

Model rankings obtained from pointwise metrics reverse when distributional metrics are used instead.
Calibration checks can separate models that appear equivalent under CRPS alone.
The choice of evaluation protocol determines the final scientific conclusion about model performance.
Downstream analyses depending on spectral features will be biased by the use of point estimators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar compression effects likely appear in other reconstruction tasks with multimodal posteriors such as medical imaging or astronomical parameter estimation.
The protocol offers a concrete way to test whether full posterior sampling avoids the variance loss shown for point estimators.
Existing pipelines that rely only on point metrics may need re-examination to check how much spectral information was lost.

Load-bearing premise

Downstream scientific measurements actually depend on the full shape of the posterior including tails and modes rather than just point estimates or low-order moments.

What would settle it

A direct comparison on a problem with known analytic multimodal posterior where the marginal variance of point predictions equals the true posterior variance instead of being smaller.

Figures

Figures reproduced from arXiv: 2605.22891 by Alexander Grohsjean, Christian Schwanenberger, Finn Labe, J\"orn Bach, Laurids Jeppe, Mads H. Baattrup, Peer Stelldinger.

**Figure 1.** Figure 1: Synthetic benchmark results. (a) Per-event posteriors for an observation (x ∗ = 1) in bimodal regime. (b) Marginal distribution of reconstructed z over 10,000 test events. (c) Global conformal calibration coverage curves for the flow and MDN models. 5 Benchmark I: A Synthetic Inverse Problem with known Multimodal Posterior We first demonstrate the failure of pointwise evaluation in a controlled setting whe… view at source ↗

**Figure 2.** Figure 2: Top reconstruction benchmark results. (a) Reconstructed per-event posterior over ∆ϕ(ℓhel, t¯tt¯). The flow-based posteriors are nearly identical. (b) Marginal distribution of ∆ϕ(ℓhel, t¯tt¯) over the test set. The flows sample a random point per event; the point estimators provide a point estimate without uncertainties. (c) Conformal coverage curve for the flows. top-quarks; observations x are detector-lev… view at source ↗

**Figure 3.** Figure 3: Additional conditional posteriors for the toy model presented in section 5. We present the [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Marginal z over 10,000 test events. Similar to fig. 1(b), but we have included the marginal recovered by the heteroscedastic regression. (b) The sensitivity of the CRPS score is measured by calculating it as a function of ensemble size, M, for the distributional models. The flow’s curve coincides with MDN’s. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Calibration diagnostics beyond marginal coverage. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: The sensitivity of the CRPS score over ∆ϕ(ℓhel, t¯tt¯) is measured by calculating it as a function of ensemble size, M, for the flow-based models [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Marginal reconstructed spectra for p tt¯ T, mtt¯, chel, and ∆ϕ(ℓhel, t¯tt¯) (in order of increasing multimodality). The dashed line shows the truth marginal and the colored curves show each method’s reconstructed marginal. Quantitative comparison via χ 2 spec is in table 8. D.8.1 Jensen gap between latent and observable estimators The conditional-mean pathology of section 3.1 has a basis-dependent refineme… view at source ↗

**Figure 8.** Figure 8: Reconstructed posteriors (filled curves) and point estimates (vertical lines) for three [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Empirical coverage at the 90% nominal level as a function of the true value of each observable. Shaded bands show binomial uncertainty per bin. The rightmost bins for p tt¯ T and mtt¯ have low statistics as indicated by the uncertainty bands. the high-p tt¯ T region (p tt¯ T ≳ 450 GeV), where both flows undercover more severely. This regime is sparsely populated in the training data, and the degraded calib… view at source ↗

read the original abstract

Evaluation in scientific reconstruction is dominated by pointwise metrics - RMSE, MAE, per-event resolution - under the implicit assumption that lower error means better reconstruction. We show that this assumption fails structurally for inverse problems with multimodal posteriors. By the law of total variance, point estimators trained to minimize MSE or MAE produce a marginal spectrum strictly narrower than the truth whenever the posterior has nonzero width. The resulting bias is independent of architecture, training, and dataset size, and it compresses precisely the spectral features - tails, modes, shapes - that downstream scientific measurements rely on. We propose a three-part evaluation protocol where each step targets a failure mode the others miss: per-event distributional accuracy via CRPS, population-level marginal accuracy via a spectrum-fidelity diagnostic, and uncertainty trustworthiness via coverage-based calibration. On a synthetic benchmark with an analytic posterior and on a realistic many-to-one inverse problem from particle physics, model rankings reverse between pointwise and distributional metrics, and calibration further separates architectures indistinguishable under CRPS. The evaluation protocol, not the model, determines the scientific conclusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pointwise metrics can narrow marginal spectra in multimodal inverse problems, but the independence-from-training claim does not hold exactly and the protocol lacks supporting details.

read the letter

The one or two things to know: pointwise metrics like RMSE or MAE produce narrower marginal distributions than the true posterior whenever the posterior has width, and the authors offer a three-part protocol to check per-event accuracy, marginal fidelity, and calibration. The protocol is the main new element. It combines CRPS for individual events, a spectrum-fidelity diagnostic for the population level, and coverage checks. This targets different failure modes that single point metrics miss. The paper does well to apply the law of total variance to explain the narrowing effect on tails and modes, and to show on synthetic and particle-physics cases that model rankings can reverse under the new metrics versus standard ones. The soft spots are around the strong claim that the bias is independent of architecture, training, and dataset size. The stress-test concern is on target: the variance identity is exact only for the population conditional mean under MSE, not for finite-sample estimators and not at all for medians under MAE. In practice the realized narrowing will vary with how well the model approximates the conditional quantity, which depends on data size and optimization. The abstract also gives no explicit formula for the spectrum-fidelity diagnostic and no quantitative results, so the ranking-reversal evidence cannot be assessed. The premise that downstream science needs the full spectrum rather than moments is stated but not evidenced. This paper is for researchers doing probabilistic reconstruction in scientific domains such as high-energy physics. A reader interested in evaluation methods would find the protocol structure worth considering as a starting point. It deserves peer review to supply the missing formulas, verify the experiments, and adjust the independence claim to match finite-data reality.

Referee Report

3 major / 1 minor

Summary. The manuscript argues that pointwise metrics (RMSE, MAE) structurally mislead in multimodal inverse problems: by the law of total variance, point estimators produce marginal spectra narrower than the true posterior whenever posterior width is nonzero, with the bias independent of architecture/training/dataset size and compressing tails/modes/shapes needed for downstream science. It proposes a three-part protocol (CRPS for per-event distributional accuracy, spectrum-fidelity diagnostic for population marginal accuracy, coverage calibration for uncertainty) and demonstrates ranking reversals on a synthetic analytic-posterior benchmark and a particle-physics many-to-one inverse problem.

Significance. If the core claims hold, the work is significant for ML evaluation in scientific inverse problems (e.g., particle physics), where it shows that metric choice can reverse model rankings and alter scientific conclusions. The analytic posterior in the synthetic benchmark is a strength for exact verification. The independence claim, if rigorously established, would be a notable result.

major comments (3)

[Abstract] Abstract: the claim that 'the resulting bias is independent of architecture, training, and dataset size' does not hold exactly. The law of total variance decomposition applies to the population conditional mean (MSE minimizer) but has no direct analogue for conditional medians (MAE); any finite-sample estimator only approximates the population quantity, so realized narrowing depends on N, capacity, and optimization.
[Abstract] Abstract: no explicit formula, derivation, or definition is supplied for the 'spectrum-fidelity diagnostic' that forms the second leg of the proposed protocol; this quantity is load-bearing for the claim that the protocol targets failure modes missed by pointwise metrics.
[Abstract] Abstract: the premise that downstream scientific measurements 'rely on' the full posterior spectrum (tails, modes, shapes) rather than low-order moments or point estimates is asserted without derivation or empirical support; this is central to the significance argument but remains an assumption.

minor comments (1)

[Abstract] The abstract is information-dense; consider separating the protocol description from the bias argument for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these precise comments on the abstract. They highlight areas where greater rigor and explicitness will strengthen the manuscript. We address each point below and have made targeted revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'the resulting bias is independent of architecture, training, and dataset size' does not hold exactly. The law of total variance decomposition applies to the population conditional mean (MSE minimizer) but has no direct analogue for conditional medians (MAE); any finite-sample estimator only approximates the population quantity, so realized narrowing depends on N, capacity, and optimization.

Authors: We agree that the law of total variance supplies an exact population-level statement for the conditional mean (MSE minimizer) and that finite-sample estimators only approach this limit. The original wording was intended to convey that the bias is architectural- and data-size-independent once the estimator converges to the population quantity, but the phrasing was imprecise. For MAE the decomposition is not identical, though the qualitative compression of marginal spectra still occurs under multimodality. We have revised the abstract to read 'in the population limit, independent of architecture...' and added a clarifying sentence in Section 2.1 distinguishing the MSE case from the MAE case while preserving the core structural claim. revision: yes
Referee: [Abstract] Abstract: no explicit formula, derivation, or definition is supplied for the 'spectrum-fidelity diagnostic' that forms the second leg of the proposed protocol; this quantity is load-bearing for the claim that the protocol targets failure modes missed by pointwise metrics.

Authors: The spectrum-fidelity diagnostic is the integrated absolute difference between the empirical CDF of the reconstructed marginal and the true marginal CDF, evaluated over a fine grid of the observable. We have inserted a concise parenthetical definition and the explicit formula into the abstract and expanded the formal definition, including the discretization used in the experiments, in the revised Section 3.2. revision: yes
Referee: [Abstract] Abstract: the premise that downstream scientific measurements 'rely on' the full posterior spectrum (tails, modes, shapes) rather than low-order moments or point estimates is asserted without derivation or empirical support; this is central to the significance argument but remains an assumption.

Authors: The premise reflects standard practice in particle-physics unfolding and resonance extraction, where tail probabilities and spectral shapes directly enter cross-section and parameter fits. We have added a short paragraph in the introduction citing representative HEP references on the necessity of full-spectrum fidelity and included a quantitative illustration from the particle-physics benchmark showing how the compressed marginal produces a statistically significant bias in a downstream observable. While a domain-general derivation is outside the paper's scope, the revision supplies both literature grounding and empirical support. revision: partial

Circularity Check

0 steps flagged

No circularity: central claim rests on external law of total variance

full rationale

The paper derives the narrowing of the marginal spectrum for point estimators from the law of total variance, an independent mathematical identity that holds for the population conditional mean and does not reduce to any fitted parameter, self-citation, or definitional loop within the manuscript. No equations rename a known empirical pattern, smuggle an ansatz via prior work, or treat a fitted input as a prediction. The proposed evaluation protocol is introduced separately and does not depend on the variance claim for its justification. The derivation chain is therefore self-contained against external benchmarks, with any overstatement regarding finite-sample MAE behavior or dataset-size independence constituting a correctness issue rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one standard probability identity and the modeling premise that scientific value resides in the full posterior shape; no free parameters or new entities are introduced in the abstract.

axioms (1)

standard math Law of total variance decomposes the marginal variance into E[Var(X|Y)] + Var(E[X|Y])
Invoked to prove that any point estimator produces a narrower marginal spectrum.

pith-pipeline@v0.9.0 · 5741 in / 1306 out tokens · 22384 ms · 2026-05-25T05:37:52.267611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 4 internal anchors

[1]

Bishop.Pattern Recognition and Machine Learning

Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, January 2006. URL https://www.microsoft.com/en-us/research/publication/pattern-recog nition-machine-learning/

work page 2006
[2]

Topological reconstruction of particle physics processes using graph neural networks.Phys

Lukas Ehrke, John Andrew Raine, Knut Zoch, Manuel Guth, and Tobias Golling. Topological reconstruction of particle physics processes using graph neural networks.Phys. Rev. D, 107 (11):116019, 2023. doi: 10.1103/PhysRevD.107.116019

work page doi:10.1103/physrevd.107.116019 2023
[3]

SPANet: Generalized permutationless set assignment for particle physics using symmetry preserving attention.SciPost Phys., 12(5):178, 2022

Alexander Shmakov, Michael James Fenton, Ta-Wei Ho, Shih-Chieh Hsu, Daniel Whiteson, and Pierre Baldi. SPANet: Generalized permutationless set assignment for particle physics using symmetry preserving attention.SciPost Phys., 12(5):178, 2022. doi: 10.21468/SciPostPh ys.12.5.178

work page doi:10.21468/scipostph 2022
[4]

Sidky and Xiaochuan Pan

Emil Y . Sidky and Xiaochuan Pan. Report on the aapm deep-learning sparse-view ct grand challenge.Medical Physics, 49(8):4935–4943, 2022. doi: https://doi.org/10.1002/mp.15489. URLhttps://aapm.onlinelibrary.wiley.com/doi/abs/10.1002/mp.15489

work page doi:10.1002/mp.15489 2022
[5]

Zhihao Wang, Jian Chen, and Steven C. H. Hoi. Deep learning for image super-resolution: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3365–3387,

work page
[6]

doi: 10.1109/TPAMI.2020.2982166

work page doi:10.1109/tpami.2020.2982166 2020
[7]

Thomas Vandal, Evan Kodra, Sangram Ganguly, Andrew Michaelis, Ramakrishna Nemani, and Auroop R. Ganguly. Generating high resolution climate change projections through single image super-resolution: an abridged version. InProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, page 5389–5393. AAAI Press, 2018. ISBN 9780...

work page 2018
[8]

Geophysical inversion versus machine learning in inverse problems

Yuji Kim and Nori Nakata. Geophysical inversion versus machine learning in inverse problems. Leading Edge, 37(12):894–901, December 2018. doi: 10.1190/tle37120894.1

work page doi:10.1190/tle37120894.1 2018
[9]

Statistically- informed deep learning for gravitational wave parameter estimation.Machine Learning: Science and Technology, 3(1):015007, November 2021

Hongyu Shen, E A Huerta, Eamonn O’Shea, Prayush Kumar, and Zhizhen Zhao. Statistically- informed deep learning for gravitational wave parameter estimation.Machine Learning: Science and Technology, 3(1):015007, November 2021. doi: 10.1088/2632-2153/ac3843. URL https://doi.org/10.1088/2632-2153/ac3843

work page doi:10.1088/2632-2153/ac3843 2021
[10]

Green, Jonathan Gair, Jakob H

Maximilian Dax, Stephen R. Green, Jonathan Gair, Jakob H. Macke, Alessandra Buonanno, and Bernhard Schölkopf. Real-time gravitational wave science with neural posterior estimation. Phys. Rev. Lett., 127:241103, December 2021. doi: 10.1103/PhysRevLett.127.241103. URL https://link.aps.org/doi/10.1103/PhysRevLett.127.241103

work page doi:10.1103/physrevlett.127.241103 2021
[11]

Fast and improved neutrino reconstruction in multineutrino final states with conditional normalizing flows.Phys

John Andrew Raine, Matthew Leigh, Knut Zoch, and Tobias Golling. Fast and improved neutrino reconstruction in multineutrino final states with conditional normalizing flows.Phys. Rev. D, 109:012005, January 2024. doi: 10.1103/PhysRevD.109.012005. URL https: //link.aps.org/doi/10.1103/PhysRevD.109.012005

work page doi:10.1103/physrevd.109.012005 2024
[12]

The frontier of simulation-based inference

Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062, 2020. doi: 10.1073/ pnas.1912789117. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.1912789117

work page doi:10.1073/pnas.1912789117 2020
[13]

Fastϵ -free inference of simulation models with bayesian conditional density estimation

George Papamakarios and Iain Murray. Fastϵ -free inference of simulation models with bayesian conditional density estimation. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file /6aca...

work page 2016
[14]

Strictly proper scoring rules, prediction, and estimation

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007. doi: 10.1198/016214 506000001437. URLhttps://doi.org/10.1198/016214506000001437

work page doi:10.1198/016214 2007
[15]

Decomposition of the continuous ranked probability score for ensemble prediction systems.Weather and Forecasting, 15(5):559–570, 2000

Hans Hersbach. Decomposition of the continuous ranked probability score for ensemble prediction systems.Weather and Forecasting, 15(5):559–570, 2000. doi: 10.1175/1520-043 4(2000)015<0559:DOTCRP>2.0.CO;2. URL https://journals.ametsoc.org/view/jo urnals/wefo/15/5/1520-0434_2000_015_0559_dotcrp_2_0_co_2.xml

work page doi:10.1175/1520-043 2000
[16]

Evaluating probabilistic forecasts with scoringrules.Journal of Statistical Software, 90(12):1–37, 2019

Alexander Jordan, Fabian Krüger, and Sebastian Lerch. Evaluating probabilistic forecasts with scoringrules.Journal of Statistical Software, 90(12):1–37, 2019. doi: 10.18637/jss.v090.i12. URLhttps://www.jstatsoft.org/index.php/jss/article/view/v090i12

work page doi:10.18637/jss.v090.i12 2019
[17]

A trust crisis in simulation-based inference? your posterior approximations can be unfaithful, 2022

Joeri Hermans, Arnaud Delaunoy, François Rozet, Antoine Wehenkel, V olodimir Begy, and Gilles Louppe. A trust crisis in simulation-based inference? your posterior approximations can be unfaithful, 2022. URLhttps://arxiv.org/abs/2110.06581

work page arXiv 2022
[18]

Validating bayesian inference algorithms with simulation-based calibration, 2020

Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. Validating bayesian inference algorithms with simulation-based calibration, 2020. URL https://arxiv. org/abs/1804.06788

work page arXiv 2020
[19]

On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks.CoRR, abs/1706.04599, 2017. URLhttp://arxiv.org/abs/1706.04599

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Accurate Uncertainties for Deep Learning Using Calibrated Regression

V olodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression.CoRR, abs/1807.00263, 2018. URL http://arxiv.or g/abs/1807.00263

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Learning by transduction

Alex Gammerman, V olodya V ovk, and Vladimir Vapnik. Learning by transduction. InPro- ceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, UAI’98, page 148–155, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN 155860555X. 11

work page 1998
[22]

Probabilistic conformal prediction using conditional random samples

Zhendong Wang, Ruijiang Gao, Mingzhang Yin, Mingyuan Zhou, and David Blei. Probabilistic conformal prediction using conditional random samples. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors,Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Researc...

work page 2023
[23]

Araz and Michael Spannowsky

Jack Y . Araz and Michael Spannowsky. Another fit bites the dust: Conformal prediction as a calibration standard for machine learning in high-energy physics, 2025. URL https: //arxiv.org/abs/2512.17048

work page arXiv 2025
[24]

Benchmarking simulation-based inference

Jan-Matthis Lueckmann, Jan Boelts, David Greenberg, Pedro Goncalves, and Jakob Macke. Benchmarking simulation-based inference. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 343–351. PMLR, May 2021. URL...

work page 2021
[25]

Sampling- based accuracy testing of posterior estimators for general inference

Pablo Lemos, Adam Coogan, Yashar Hezaveh, and Laurence Perreault-Levasseur. Sampling- based accuracy testing of posterior estimators for general inference. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofPro...

work page 2023
[26]

Inversebench: Benchmarking plug-and-play diffusion priors for inverse problems in physical sciences

Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy Feng, Caifeng Zou, Yu Sun, Nikola Borislavov Kovachki, Zachary E Ross, Katherine Bouman, and Yisong Yue. Inversebench: Benchmarking plug-and-play diffusion priors for inverse problems in physical sciences. InThe Thirteenth International Conference on Learning Representations, 2025. U...

work page 2025
[27]

Jan A. Högbom. Aperture Synthesis with a Non-Regular Distribution of Interferometer Base- lines.Astron. Astrophys. Suppl. Ser., 15:417, June 1974

work page 1974
[28]

Algebraic approach to solve tt dilepton equations.Phys

Lars Sonnenschein. Algebraic approach to solve tt dilepton equations.Phys. Rev. D, 72:095020, November 2005. doi: 10.1103/PhysRevD.72.095020. URL https://link.aps.org/doi/1 0.1103/PhysRevD.72.095020

work page doi:10.1103/physrevd.72.095020 2005
[29]

Enhanced reconstruction of dileptonic top quark-antiquark events using supervised machine learning methods

The CMS Collaboration. Enhanced reconstruction of dileptonic top quark-antiquark events using supervised machine learning methods. Technical report, CERN, Geneva, 2025. URL https://cds.cern.ch/record/2944724

work page arXiv 2025
[30]

Analyzing inverse problems with invertible neural networks

Lynton Ardizzone, Jakob Kruse, Carsten Rother, and Ullrich Köthe. Analyzing inverse problems with invertible neural networks. InInternational Conference on Learning Representations,

work page
[31]

URLhttps://openreview.net/forum?id=rJed6j0cKX

work page
[32]

OUP Oxford,

Geoffrey Grimmett and David Stirzaker.Probability and Random Processes. OUP Oxford,

work page
[33]

Stanberry, Eric P

Tilmann Gneiting, Larissa I. Stanberry, Eric P. Grimit, Leonhard Held, and Nicholas A. Johnson. Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds.TEST, 17(2):211–235, August 2008. ISSN 1863-8260. doi: 10.1007/s11749-008-0114-x. URLhttps://doi.org/10.1007/s11749-008-0114-x

work page doi:10.1007/s11749-008-0114-x 2008
[34]

Springer New York, New York, NY ,

Karl Pearson.On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling, pages 11–28. Springer New York, New York, NY ,

work page
[35]

doi: 10.1007/978-1-4612-4380-9_2

ISBN 978-1-4612-4380-9. doi: 10.1007/978-1-4612-4380-9_2. URL https: //doi.org/10.1007/978-1-4612-4380-9_2

work page doi:10.1007/978-1-4612-4380-9_2
[36]

Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for distributions with applica- tions to image databases. InSixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pages 59–66, January 1998. doi: 10.1109/ICCV.1998.710701. 12

work page doi:10.1109/iccv.1998.710701 1998
[37]

Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E. Raftery. Probabilistic forecasts, calibration and sharpness.Journal of the Royal Statistical Society Series B, 69(2):243–268, 2007. URL https://EconPapers.repec.org/RePEc:bla:jorssb:v:69:y:2007:i:2:p:243-268

work page 2007
[38]

Neural spline flows

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/7ac71d4 33f282034e...

work page 2019
[39]

Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021. URL http://jmlr.org/papers/v22/19 -1028.html

work page 2021
[40]

How to unfold top decays.SciPost Phys

Luigi Favaro, Roman Kogler, Alexander Paasch, Sofia Palacios Schweitzer, Tilman Plehn, and Dennis Schwarz. How to unfold top decays.SciPost Phys. Core, 8:053, 2025. doi: 10.21468/SciPostPhysCore.8.3.053. URL https://scipost.org/10.21468/SciPostPh ysCore.8.3.053

work page doi:10.21468/scipostphyscore.8.3.053 2025
[41]

Observation of a pseudoscalar excess at the top quark pair production threshold.Reports on Progress in Physics, 88(8):087801, August 2025

The CMS Collaboration. Observation of a pseudoscalar excess at the top quark pair production threshold.Reports on Progress in Physics, 88(8):087801, August 2025. doi: 10.1088/1361-663 3/adf7d3. URLhttps://doi.org/10.1088/1361-6633/adf7d3

work page doi:10.1088/1361-663 2025
[42]

DELPHES 3, A modular framework for fast simulation of a generic collider experiment

Jerome de Favereau, Christophe Delaere, Pavel Demin, Andrea Giammanco, Vincent Lemaître, Alexandre Mertens, Michele Selvaggi, and The DELPHES 3 collaboration. Delphes 3: a modular framework for fast simulation of a generic collider experiment.Journal of High Energy Physics, 2014(2):57, 2014. doi: 10.1007/JHEP02(2014)057

work page internal anchor Pith review doi:10.1007/jhep02(2014)057 2014
[43]

Dileptonic ttbar neutrino regression dataset, July 2023

John Andrew Raine, Matthew Leigh, Knut Zoch, Lukas Ehrke, Debajyoti Sengupta, and Tobias Golling. Dileptonic ttbar neutrino regression dataset, July 2023. URL https://doi.org/10 .5281/zenodo.8113516

work page 2023
[44]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations.Advances in Neural Information Processing Systems, 2018

work page 2018
[45]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

work page 2023
[46]

The landscape of unfolding with machine learning.SciPost Phys., 18:070, 2025

Nathan Huetsch, Javier Mariño Villadamigo, Alexander Shmakov, Sascha Diefenbacher, Vinicius Mikuni, Theo Heimel, Michael Fenton, Kevin Greif, Benjamin Nachman, Daniel Whiteson, Anja Butter, and Tilman Plehn. The landscape of unfolding with machine learning.SciPost Phys., 18:070, 2025. doi: 10.21468/SciPostPhys.18.2.070. URL https://scipost.org/10.21468/Sc...

work page doi:10.21468/scipostphys.18.2.070 2025
[47]

Invertible networks or partons to detector and back again.SciPost Phys., 9:074, 2020

Marco Bellagente, Anja Butter, Gregor Kasieczka, Tilman Plehn, Armand Rousselot, Ramon Winterhalder, Lynton Ardizzone, and Ullrich Köthe. Invertible networks or partons to detector and back again.SciPost Phys., 9:074, 2020. doi: 10.21468/SciPostPhys.9.5.074. URL https://scipost.org/10.21468/SciPostPhys.9.5.074

work page doi:10.21468/scipostphys.9.5.074 2020
[48]

Generative unfolding of jets and their substructure, 2025

Antoine Petitjean, Anja Butter, Kevin Greif, Sofia Palacios Schweitzer, Tilman Plehn, Jonas Spinner, and Daniel Whiteson. Generative unfolding of jets and their substructure, 2025. URL https://arxiv.org/abs/2510.19906

work page arXiv 2025
[49]

Calibrated reliable regression using maximum mean discrepancy

Peng Cui, Wenbo Hu, and Jun Zhu. Calibrated reliable regression using maximum mean discrepancy. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17164–17175. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper /2020/file/c74c4bf0d...

work page 2020
[50]

Christopher A. T. Ferro, David S. Richardson, and Andreas P. Weigel. On the effect of ensemble size on the discrete and continuous ranked probability scores.Meteorological Applications, 15 (1):19–24, 2008. doi: https://doi.org/10.1002/met.45. URL https://rmets.onlinelibrar y.wiley.com/doi/abs/10.1002/met.45

work page doi:10.1002/met.45 2008
[51]

Steve Baker and Robert D. Cousins. Clarification of the use of chi-square and likelihood functions in fits to histograms.Nuclear Instruments and Methods in Physics Research, 221(2): 437–442, 1984. ISSN 0167-5087. doi: https://doi.org/10.1016/0167-5087(84)90016-4. URL https://www.sciencedirect.com/science/article/pii/0167508784900164

work page doi:10.1016/0167-5087(84)90016-4 1984
[52]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.CoRR, abs/1711.05101, 2017. URLhttp://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

Christopher M. Bishop. Mixture density networks. WorkingPaper 4288, Aston University, 1994

work page 1994
[54]

normflows: A pytorch package for normalizing flows.Journal of Open Source Software, 8(86):5361, 2023

Vincent Stimper, David Liu, Andrew Campbell, Vincent Berenz, Lukas Ryll, Bernhard Schölkopf, and José Miguel Hernández-Lobato. normflows: A pytorch package for normalizing flows.Journal of Open Source Software, 8(86):5361, 2023. doi: 10.21105/joss.05361. URL https://doi.org/10.21105/joss.05361

work page doi:10.21105/joss.05361 2023
[55]

A Lorentz-equivariant transformer for all of the LHC.SciPost Phys., 19:108, 2025

Johann Brehmer, Víctor Bresó, Pim de Haan, Tilman Plehn, Huilin Qu, Jonas Spinner, and Jesse Thaler. A Lorentz-equivariant transformer for all of the LHC.SciPost Phys., 19:108, 2025. doi: 10.21468/SciPostPhys.19.4.108. URL https://scipost.org/10.21468/SciPostPhys.1 9.4.108

work page doi:10.21468/scipostphys.19.4.108 2025
[56]

Lorentz-equivariant geometric algebra transformers for high-energy physics

Jonas Spinner, Victor Bresó, Pim De Haan, Tilman Plehn, Jesse Thaler, and Johann Brehmer. Lorentz-equivariant geometric algebra transformers for high-energy physics. InAdvances in Neural Information Processing Systems, volume 37, 2024. URL https://arxiv.org/abs/ 2405.14806

work page arXiv 2024
[57]

Geometric algebra trans- former

Johann Brehmer, Pim de Haan, Sönke Behrends, and Taco Cohen. Geometric algebra trans- former. InAdvances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2305.18415

work page arXiv 2023
[58]

Ricky T. Q. Chen. torchdiffeq, 2018. URL https://github.com/rtqichen/torchdiffeq. A Evaluation Metrics A.1 Empirical CRPS estimator For a predictive distribution represented by N posterior samples {ˆz(k)}N k=1, the CRPS of eq. (1) is estimated as \CRPS = 1 N NX k=1 ˆz(k) −z − 1 2N2 NX k=1 NX j=1 ˆz(k) −ˆz(j) ,(8) computable in O(NlogN) via sorting [15]. T...

work page 2018
[59]

They are the natural choice when the protocol’s univariate CRPS is replaced by the energy score (section 4.1) for joint-posterior applications

and TARP [24] extend coverage diagnostics to the joint setting. They are the natural choice when the protocol’s univariate CRPS is replaced by the energy score (section 4.1) for joint-posterior applications. Choosing among themWe recommend to use conformal prediction when comparing across model families (its finite-sample guarantee is family-agnostic). Us...

work page
[60]

This results in a system with six unknown degrees of freedom (the three-momentum components for each neutrino) but only two direct experimental constraints

Underdetermined Kinematics:While the detector measures the transverse components of the sum of the neutrino momenta ⃗Emiss T , the individual longitudinal momenta (pz) and the specific distribution of transverse momentum between the two neutrinos are un- known. This results in a system with six unknown degrees of freedom (the three-momentum components for...

work page
[61]

For n additional light-flavor jets in the event, the number of possible permutations for the final-state assignment grows factorially, creating a complex assignment problem

Combinatorial Ambiguity:In a standard event, the detector identifies two b-jets, but it is not a priori known which jet originated from the top quark and which from the anti-top quark. For n additional light-flavor jets in the event, the number of possible permutations for the final-state assignment grows factorially, creating a complex assignment problem

work page
[62]

kinematic fitting

Detector Resolution and Noise:The measured momenta of jets and the ⃗Emiss T are subject to experimental uncertainties and resolution effects. Traditional analytical “kinematic fitting” methods often fail when the measured values fluctuate such that no physical solution exists for the mass constraints. D.2 Dataset details We use the public Delphes [39] Mon...

work page doi:10.5281/zen-
[63]

posterior mean

architecture combining a transformer condition encoder with a stack of RQS coupling layers for the discrete flow. Regression Continuous flow Discrete flow (MSE & MMD) (flow matching) (ν 2-flow style) Condition encoder Encoder blocks8 4 4 Attention heads8 8 8 Hidden dimension128 128 128 Dropout0.1 0.1 0.1 Positional encoding dim8 8 8 Flow / decoder head De...

work page arXiv 2000

[1] [1]

Bishop.Pattern Recognition and Machine Learning

Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, January 2006. URL https://www.microsoft.com/en-us/research/publication/pattern-recog nition-machine-learning/

work page 2006

[2] [2]

Topological reconstruction of particle physics processes using graph neural networks.Phys

Lukas Ehrke, John Andrew Raine, Knut Zoch, Manuel Guth, and Tobias Golling. Topological reconstruction of particle physics processes using graph neural networks.Phys. Rev. D, 107 (11):116019, 2023. doi: 10.1103/PhysRevD.107.116019

work page doi:10.1103/physrevd.107.116019 2023

[3] [3]

SPANet: Generalized permutationless set assignment for particle physics using symmetry preserving attention.SciPost Phys., 12(5):178, 2022

Alexander Shmakov, Michael James Fenton, Ta-Wei Ho, Shih-Chieh Hsu, Daniel Whiteson, and Pierre Baldi. SPANet: Generalized permutationless set assignment for particle physics using symmetry preserving attention.SciPost Phys., 12(5):178, 2022. doi: 10.21468/SciPostPh ys.12.5.178

work page doi:10.21468/scipostph 2022

[4] [4]

Sidky and Xiaochuan Pan

Emil Y . Sidky and Xiaochuan Pan. Report on the aapm deep-learning sparse-view ct grand challenge.Medical Physics, 49(8):4935–4943, 2022. doi: https://doi.org/10.1002/mp.15489. URLhttps://aapm.onlinelibrary.wiley.com/doi/abs/10.1002/mp.15489

work page doi:10.1002/mp.15489 2022

[5] [5]

Zhihao Wang, Jian Chen, and Steven C. H. Hoi. Deep learning for image super-resolution: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3365–3387,

work page

[6] [6]

doi: 10.1109/TPAMI.2020.2982166

work page doi:10.1109/tpami.2020.2982166 2020

[7] [7]

Thomas Vandal, Evan Kodra, Sangram Ganguly, Andrew Michaelis, Ramakrishna Nemani, and Auroop R. Ganguly. Generating high resolution climate change projections through single image super-resolution: an abridged version. InProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, page 5389–5393. AAAI Press, 2018. ISBN 9780...

work page 2018

[8] [8]

Geophysical inversion versus machine learning in inverse problems

Yuji Kim and Nori Nakata. Geophysical inversion versus machine learning in inverse problems. Leading Edge, 37(12):894–901, December 2018. doi: 10.1190/tle37120894.1

work page doi:10.1190/tle37120894.1 2018

[9] [9]

Statistically- informed deep learning for gravitational wave parameter estimation.Machine Learning: Science and Technology, 3(1):015007, November 2021

Hongyu Shen, E A Huerta, Eamonn O’Shea, Prayush Kumar, and Zhizhen Zhao. Statistically- informed deep learning for gravitational wave parameter estimation.Machine Learning: Science and Technology, 3(1):015007, November 2021. doi: 10.1088/2632-2153/ac3843. URL https://doi.org/10.1088/2632-2153/ac3843

work page doi:10.1088/2632-2153/ac3843 2021

[10] [10]

Green, Jonathan Gair, Jakob H

Maximilian Dax, Stephen R. Green, Jonathan Gair, Jakob H. Macke, Alessandra Buonanno, and Bernhard Schölkopf. Real-time gravitational wave science with neural posterior estimation. Phys. Rev. Lett., 127:241103, December 2021. doi: 10.1103/PhysRevLett.127.241103. URL https://link.aps.org/doi/10.1103/PhysRevLett.127.241103

work page doi:10.1103/physrevlett.127.241103 2021

[11] [11]

Fast and improved neutrino reconstruction in multineutrino final states with conditional normalizing flows.Phys

John Andrew Raine, Matthew Leigh, Knut Zoch, and Tobias Golling. Fast and improved neutrino reconstruction in multineutrino final states with conditional normalizing flows.Phys. Rev. D, 109:012005, January 2024. doi: 10.1103/PhysRevD.109.012005. URL https: //link.aps.org/doi/10.1103/PhysRevD.109.012005

work page doi:10.1103/physrevd.109.012005 2024

[12] [12]

The frontier of simulation-based inference

Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062, 2020. doi: 10.1073/ pnas.1912789117. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.1912789117

work page doi:10.1073/pnas.1912789117 2020

[13] [13]

Fastϵ -free inference of simulation models with bayesian conditional density estimation

George Papamakarios and Iain Murray. Fastϵ -free inference of simulation models with bayesian conditional density estimation. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file /6aca...

work page 2016

[14] [14]

Strictly proper scoring rules, prediction, and estimation

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007. doi: 10.1198/016214 506000001437. URLhttps://doi.org/10.1198/016214506000001437

work page doi:10.1198/016214 2007

[15] [15]

Decomposition of the continuous ranked probability score for ensemble prediction systems.Weather and Forecasting, 15(5):559–570, 2000

Hans Hersbach. Decomposition of the continuous ranked probability score for ensemble prediction systems.Weather and Forecasting, 15(5):559–570, 2000. doi: 10.1175/1520-043 4(2000)015<0559:DOTCRP>2.0.CO;2. URL https://journals.ametsoc.org/view/jo urnals/wefo/15/5/1520-0434_2000_015_0559_dotcrp_2_0_co_2.xml

work page doi:10.1175/1520-043 2000

[16] [16]

Evaluating probabilistic forecasts with scoringrules.Journal of Statistical Software, 90(12):1–37, 2019

Alexander Jordan, Fabian Krüger, and Sebastian Lerch. Evaluating probabilistic forecasts with scoringrules.Journal of Statistical Software, 90(12):1–37, 2019. doi: 10.18637/jss.v090.i12. URLhttps://www.jstatsoft.org/index.php/jss/article/view/v090i12

work page doi:10.18637/jss.v090.i12 2019

[17] [17]

A trust crisis in simulation-based inference? your posterior approximations can be unfaithful, 2022

Joeri Hermans, Arnaud Delaunoy, François Rozet, Antoine Wehenkel, V olodimir Begy, and Gilles Louppe. A trust crisis in simulation-based inference? your posterior approximations can be unfaithful, 2022. URLhttps://arxiv.org/abs/2110.06581

work page arXiv 2022

[18] [18]

Validating bayesian inference algorithms with simulation-based calibration, 2020

Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. Validating bayesian inference algorithms with simulation-based calibration, 2020. URL https://arxiv. org/abs/1804.06788

work page arXiv 2020

[19] [19]

On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks.CoRR, abs/1706.04599, 2017. URLhttp://arxiv.org/abs/1706.04599

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Accurate Uncertainties for Deep Learning Using Calibrated Regression

V olodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression.CoRR, abs/1807.00263, 2018. URL http://arxiv.or g/abs/1807.00263

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Learning by transduction

Alex Gammerman, V olodya V ovk, and Vladimir Vapnik. Learning by transduction. InPro- ceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, UAI’98, page 148–155, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN 155860555X. 11

work page 1998

[22] [22]

Probabilistic conformal prediction using conditional random samples

Zhendong Wang, Ruijiang Gao, Mingzhang Yin, Mingyuan Zhou, and David Blei. Probabilistic conformal prediction using conditional random samples. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors,Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Researc...

work page 2023

[23] [23]

Araz and Michael Spannowsky

Jack Y . Araz and Michael Spannowsky. Another fit bites the dust: Conformal prediction as a calibration standard for machine learning in high-energy physics, 2025. URL https: //arxiv.org/abs/2512.17048

work page arXiv 2025

[24] [24]

Benchmarking simulation-based inference

Jan-Matthis Lueckmann, Jan Boelts, David Greenberg, Pedro Goncalves, and Jakob Macke. Benchmarking simulation-based inference. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 343–351. PMLR, May 2021. URL...

work page 2021

[25] [25]

Sampling- based accuracy testing of posterior estimators for general inference

Pablo Lemos, Adam Coogan, Yashar Hezaveh, and Laurence Perreault-Levasseur. Sampling- based accuracy testing of posterior estimators for general inference. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofPro...

work page 2023

[26] [26]

Inversebench: Benchmarking plug-and-play diffusion priors for inverse problems in physical sciences

Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy Feng, Caifeng Zou, Yu Sun, Nikola Borislavov Kovachki, Zachary E Ross, Katherine Bouman, and Yisong Yue. Inversebench: Benchmarking plug-and-play diffusion priors for inverse problems in physical sciences. InThe Thirteenth International Conference on Learning Representations, 2025. U...

work page 2025

[27] [27]

Jan A. Högbom. Aperture Synthesis with a Non-Regular Distribution of Interferometer Base- lines.Astron. Astrophys. Suppl. Ser., 15:417, June 1974

work page 1974

[28] [28]

Algebraic approach to solve tt dilepton equations.Phys

Lars Sonnenschein. Algebraic approach to solve tt dilepton equations.Phys. Rev. D, 72:095020, November 2005. doi: 10.1103/PhysRevD.72.095020. URL https://link.aps.org/doi/1 0.1103/PhysRevD.72.095020

work page doi:10.1103/physrevd.72.095020 2005

[29] [29]

Enhanced reconstruction of dileptonic top quark-antiquark events using supervised machine learning methods

The CMS Collaboration. Enhanced reconstruction of dileptonic top quark-antiquark events using supervised machine learning methods. Technical report, CERN, Geneva, 2025. URL https://cds.cern.ch/record/2944724

work page arXiv 2025

[30] [30]

Analyzing inverse problems with invertible neural networks

Lynton Ardizzone, Jakob Kruse, Carsten Rother, and Ullrich Köthe. Analyzing inverse problems with invertible neural networks. InInternational Conference on Learning Representations,

work page

[31] [31]

URLhttps://openreview.net/forum?id=rJed6j0cKX

work page

[32] [32]

OUP Oxford,

Geoffrey Grimmett and David Stirzaker.Probability and Random Processes. OUP Oxford,

work page

[33] [33]

Stanberry, Eric P

Tilmann Gneiting, Larissa I. Stanberry, Eric P. Grimit, Leonhard Held, and Nicholas A. Johnson. Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds.TEST, 17(2):211–235, August 2008. ISSN 1863-8260. doi: 10.1007/s11749-008-0114-x. URLhttps://doi.org/10.1007/s11749-008-0114-x

work page doi:10.1007/s11749-008-0114-x 2008

[34] [34]

Springer New York, New York, NY ,

Karl Pearson.On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling, pages 11–28. Springer New York, New York, NY ,

work page

[35] [35]

doi: 10.1007/978-1-4612-4380-9_2

ISBN 978-1-4612-4380-9. doi: 10.1007/978-1-4612-4380-9_2. URL https: //doi.org/10.1007/978-1-4612-4380-9_2

work page doi:10.1007/978-1-4612-4380-9_2

[36] [36]

Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for distributions with applica- tions to image databases. InSixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pages 59–66, January 1998. doi: 10.1109/ICCV.1998.710701. 12

work page doi:10.1109/iccv.1998.710701 1998

[37] [37]

Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E. Raftery. Probabilistic forecasts, calibration and sharpness.Journal of the Royal Statistical Society Series B, 69(2):243–268, 2007. URL https://EconPapers.repec.org/RePEc:bla:jorssb:v:69:y:2007:i:2:p:243-268

work page 2007

[38] [38]

Neural spline flows

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/7ac71d4 33f282034e...

work page 2019

[39] [39]

Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021. URL http://jmlr.org/papers/v22/19 -1028.html

work page 2021

[40] [40]

How to unfold top decays.SciPost Phys

Luigi Favaro, Roman Kogler, Alexander Paasch, Sofia Palacios Schweitzer, Tilman Plehn, and Dennis Schwarz. How to unfold top decays.SciPost Phys. Core, 8:053, 2025. doi: 10.21468/SciPostPhysCore.8.3.053. URL https://scipost.org/10.21468/SciPostPh ysCore.8.3.053

work page doi:10.21468/scipostphyscore.8.3.053 2025

[41] [41]

Observation of a pseudoscalar excess at the top quark pair production threshold.Reports on Progress in Physics, 88(8):087801, August 2025

The CMS Collaboration. Observation of a pseudoscalar excess at the top quark pair production threshold.Reports on Progress in Physics, 88(8):087801, August 2025. doi: 10.1088/1361-663 3/adf7d3. URLhttps://doi.org/10.1088/1361-6633/adf7d3

work page doi:10.1088/1361-663 2025

[42] [42]

DELPHES 3, A modular framework for fast simulation of a generic collider experiment

Jerome de Favereau, Christophe Delaere, Pavel Demin, Andrea Giammanco, Vincent Lemaître, Alexandre Mertens, Michele Selvaggi, and The DELPHES 3 collaboration. Delphes 3: a modular framework for fast simulation of a generic collider experiment.Journal of High Energy Physics, 2014(2):57, 2014. doi: 10.1007/JHEP02(2014)057

work page internal anchor Pith review doi:10.1007/jhep02(2014)057 2014

[43] [43]

Dileptonic ttbar neutrino regression dataset, July 2023

John Andrew Raine, Matthew Leigh, Knut Zoch, Lukas Ehrke, Debajyoti Sengupta, and Tobias Golling. Dileptonic ttbar neutrino regression dataset, July 2023. URL https://doi.org/10 .5281/zenodo.8113516

work page 2023

[44] [44]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations.Advances in Neural Information Processing Systems, 2018

work page 2018

[45] [45]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

work page 2023

[46] [46]

The landscape of unfolding with machine learning.SciPost Phys., 18:070, 2025

Nathan Huetsch, Javier Mariño Villadamigo, Alexander Shmakov, Sascha Diefenbacher, Vinicius Mikuni, Theo Heimel, Michael Fenton, Kevin Greif, Benjamin Nachman, Daniel Whiteson, Anja Butter, and Tilman Plehn. The landscape of unfolding with machine learning.SciPost Phys., 18:070, 2025. doi: 10.21468/SciPostPhys.18.2.070. URL https://scipost.org/10.21468/Sc...

work page doi:10.21468/scipostphys.18.2.070 2025

[47] [47]

Invertible networks or partons to detector and back again.SciPost Phys., 9:074, 2020

Marco Bellagente, Anja Butter, Gregor Kasieczka, Tilman Plehn, Armand Rousselot, Ramon Winterhalder, Lynton Ardizzone, and Ullrich Köthe. Invertible networks or partons to detector and back again.SciPost Phys., 9:074, 2020. doi: 10.21468/SciPostPhys.9.5.074. URL https://scipost.org/10.21468/SciPostPhys.9.5.074

work page doi:10.21468/scipostphys.9.5.074 2020

[48] [48]

Generative unfolding of jets and their substructure, 2025

Antoine Petitjean, Anja Butter, Kevin Greif, Sofia Palacios Schweitzer, Tilman Plehn, Jonas Spinner, and Daniel Whiteson. Generative unfolding of jets and their substructure, 2025. URL https://arxiv.org/abs/2510.19906

work page arXiv 2025

[49] [49]

Calibrated reliable regression using maximum mean discrepancy

Peng Cui, Wenbo Hu, and Jun Zhu. Calibrated reliable regression using maximum mean discrepancy. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17164–17175. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper /2020/file/c74c4bf0d...

work page 2020

[50] [50]

Christopher A. T. Ferro, David S. Richardson, and Andreas P. Weigel. On the effect of ensemble size on the discrete and continuous ranked probability scores.Meteorological Applications, 15 (1):19–24, 2008. doi: https://doi.org/10.1002/met.45. URL https://rmets.onlinelibrar y.wiley.com/doi/abs/10.1002/met.45

work page doi:10.1002/met.45 2008

[51] [51]

Steve Baker and Robert D. Cousins. Clarification of the use of chi-square and likelihood functions in fits to histograms.Nuclear Instruments and Methods in Physics Research, 221(2): 437–442, 1984. ISSN 0167-5087. doi: https://doi.org/10.1016/0167-5087(84)90016-4. URL https://www.sciencedirect.com/science/article/pii/0167508784900164

work page doi:10.1016/0167-5087(84)90016-4 1984

[52] [52]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.CoRR, abs/1711.05101, 2017. URLhttp://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017

[53] [53]

Christopher M. Bishop. Mixture density networks. WorkingPaper 4288, Aston University, 1994

work page 1994

[54] [54]

normflows: A pytorch package for normalizing flows.Journal of Open Source Software, 8(86):5361, 2023

Vincent Stimper, David Liu, Andrew Campbell, Vincent Berenz, Lukas Ryll, Bernhard Schölkopf, and José Miguel Hernández-Lobato. normflows: A pytorch package for normalizing flows.Journal of Open Source Software, 8(86):5361, 2023. doi: 10.21105/joss.05361. URL https://doi.org/10.21105/joss.05361

work page doi:10.21105/joss.05361 2023

[55] [55]

A Lorentz-equivariant transformer for all of the LHC.SciPost Phys., 19:108, 2025

Johann Brehmer, Víctor Bresó, Pim de Haan, Tilman Plehn, Huilin Qu, Jonas Spinner, and Jesse Thaler. A Lorentz-equivariant transformer for all of the LHC.SciPost Phys., 19:108, 2025. doi: 10.21468/SciPostPhys.19.4.108. URL https://scipost.org/10.21468/SciPostPhys.1 9.4.108

work page doi:10.21468/scipostphys.19.4.108 2025

[56] [56]

Lorentz-equivariant geometric algebra transformers for high-energy physics

Jonas Spinner, Victor Bresó, Pim De Haan, Tilman Plehn, Jesse Thaler, and Johann Brehmer. Lorentz-equivariant geometric algebra transformers for high-energy physics. InAdvances in Neural Information Processing Systems, volume 37, 2024. URL https://arxiv.org/abs/ 2405.14806

work page arXiv 2024

[57] [57]

Geometric algebra trans- former

Johann Brehmer, Pim de Haan, Sönke Behrends, and Taco Cohen. Geometric algebra trans- former. InAdvances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2305.18415

work page arXiv 2023

[58] [58]

Ricky T. Q. Chen. torchdiffeq, 2018. URL https://github.com/rtqichen/torchdiffeq. A Evaluation Metrics A.1 Empirical CRPS estimator For a predictive distribution represented by N posterior samples {ˆz(k)}N k=1, the CRPS of eq. (1) is estimated as \CRPS = 1 N NX k=1 ˆz(k) −z − 1 2N2 NX k=1 NX j=1 ˆz(k) −ˆz(j) ,(8) computable in O(NlogN) via sorting [15]. T...

work page 2018

[59] [59]

They are the natural choice when the protocol’s univariate CRPS is replaced by the energy score (section 4.1) for joint-posterior applications

and TARP [24] extend coverage diagnostics to the joint setting. They are the natural choice when the protocol’s univariate CRPS is replaced by the energy score (section 4.1) for joint-posterior applications. Choosing among themWe recommend to use conformal prediction when comparing across model families (its finite-sample guarantee is family-agnostic). Us...

work page

[60] [60]

This results in a system with six unknown degrees of freedom (the three-momentum components for each neutrino) but only two direct experimental constraints

Underdetermined Kinematics:While the detector measures the transverse components of the sum of the neutrino momenta ⃗Emiss T , the individual longitudinal momenta (pz) and the specific distribution of transverse momentum between the two neutrinos are un- known. This results in a system with six unknown degrees of freedom (the three-momentum components for...

work page

[61] [61]

For n additional light-flavor jets in the event, the number of possible permutations for the final-state assignment grows factorially, creating a complex assignment problem

Combinatorial Ambiguity:In a standard event, the detector identifies two b-jets, but it is not a priori known which jet originated from the top quark and which from the anti-top quark. For n additional light-flavor jets in the event, the number of possible permutations for the final-state assignment grows factorially, creating a complex assignment problem

work page

[62] [62]

kinematic fitting

Detector Resolution and Noise:The measured momenta of jets and the ⃗Emiss T are subject to experimental uncertainties and resolution effects. Traditional analytical “kinematic fitting” methods often fail when the measured values fluctuate such that no physical solution exists for the mass constraints. D.2 Dataset details We use the public Delphes [39] Mon...

work page doi:10.5281/zen-

[63] [63]

posterior mean

architecture combining a transformer condition encoder with a stack of RQS coupling layers for the discrete flow. Regression Continuous flow Discrete flow (MSE & MMD) (flow matching) (ν 2-flow style) Condition encoder Encoder blocks8 4 4 Attention heads8 8 8 Hidden dimension128 128 128 Dropout0.1 0.1 0.1 Positional encoding dim8 8 8 Flow / decoder head De...

work page arXiv 2000