pith. sign in

arxiv: 2605.22891 · v1 · pith:C3YG3ETQnew · submitted 2026-05-21 · 💻 cs.LG · hep-ex

Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

Pith reviewed 2026-05-25 05:37 UTC · model grok-4.3

classification 💻 cs.LG hep-ex
keywords multimodal inverse problemspointwise metricsevaluation protocolposterior spectrumCRPSuncertainty calibrationparticle physics reconstruction
0
0 comments X

The pith

Point estimators minimizing MSE or MAE always produce narrower marginal spectra than the true posterior in multimodal inverse problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluation of scientific reconstructions relies heavily on pointwise metrics such as RMSE and MAE. The paper shows this leads to systematic bias because, by the law of total variance, any such point estimator compresses the spectrum whenever the posterior has width. This compression hides the very features like tails and modes that matter for later measurements. To address it, the authors introduce a protocol with three checks: distributional accuracy per event using CRPS, overall marginal spectrum match, and proper uncertainty calibration. Experiments on synthetic data and a particle physics inverse problem demonstrate that conclusions about which model is best can flip depending on the evaluation method used.

Core claim

The central discovery is that pointwise metrics are structurally misleading for inverse problems with multimodal posteriors. By the law of total variance, point estimators trained to minimize MSE or MAE produce a marginal spectrum strictly narrower than the truth. The bias is independent of architecture, training, and dataset size. A three-part protocol is proposed: CRPS for per-event accuracy, a spectrum-fidelity diagnostic for population marginals, and coverage calibration for uncertainty. On benchmarks, model rankings reverse and calibration distinguishes further.

What carries the argument

The law of total variance decomposition showing that point predictions from a multimodal posterior must have strictly smaller marginal variance than the true distribution.

If this is right

  • Model rankings obtained from pointwise metrics reverse when distributional metrics are used instead.
  • Calibration checks can separate models that appear equivalent under CRPS alone.
  • The choice of evaluation protocol determines the final scientific conclusion about model performance.
  • Downstream analyses depending on spectral features will be biased by the use of point estimators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar compression effects likely appear in other reconstruction tasks with multimodal posteriors such as medical imaging or astronomical parameter estimation.
  • The protocol offers a concrete way to test whether full posterior sampling avoids the variance loss shown for point estimators.
  • Existing pipelines that rely only on point metrics may need re-examination to check how much spectral information was lost.

Load-bearing premise

Downstream scientific measurements actually depend on the full shape of the posterior including tails and modes rather than just point estimates or low-order moments.

What would settle it

A direct comparison on a problem with known analytic multimodal posterior where the marginal variance of point predictions equals the true posterior variance instead of being smaller.

Figures

Figures reproduced from arXiv: 2605.22891 by Alexander Grohsjean, Christian Schwanenberger, Finn Labe, J\"orn Bach, Laurids Jeppe, Mads H. Baattrup, Peer Stelldinger.

Figure 1
Figure 1. Figure 1: Synthetic benchmark results. (a) Per-event posteriors for an observation (x ∗ = 1) in bimodal regime. (b) Marginal distribution of reconstructed z over 10,000 test events. (c) Global conformal calibration coverage curves for the flow and MDN models. 5 Benchmark I: A Synthetic Inverse Problem with known Multimodal Posterior We first demonstrate the failure of pointwise evaluation in a controlled setting whe… view at source ↗
Figure 2
Figure 2. Figure 2: Top reconstruction benchmark results. (a) Reconstructed per-event posterior over ∆ϕ(ℓhel, t¯tt¯). The flow-based posteriors are nearly identical. (b) Marginal distribution of ∆ϕ(ℓhel, t¯tt¯) over the test set. The flows sample a random point per event; the point estimators provide a point estimate without uncertainties. (c) Conformal coverage curve for the flows. top-quarks; observations x are detector-lev… view at source ↗
Figure 3
Figure 3. Figure 3: Additional conditional posteriors for the toy model presented in section 5. We present the [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Marginal z over 10,000 test events. Similar to fig. 1(b), but we have included the marginal recovered by the heteroscedastic regression. (b) The sensitivity of the CRPS score is measured by calculating it as a function of ensemble size, M, for the distributional models. The flow’s curve coincides with MDN’s. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Calibration diagnostics beyond marginal coverage. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The sensitivity of the CRPS score over ∆ϕ(ℓhel, t¯tt¯) is measured by calculating it as a function of ensemble size, M, for the flow-based models [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Marginal reconstructed spectra for p tt¯ T, mtt¯, chel, and ∆ϕ(ℓhel, t¯tt¯) (in order of increasing multimodality). The dashed line shows the truth marginal and the colored curves show each method’s reconstructed marginal. Quantitative comparison via χ 2 spec is in table 8. D.8.1 Jensen gap between latent and observable estimators The conditional-mean pathology of section 3.1 has a basis-dependent refineme… view at source ↗
Figure 8
Figure 8. Figure 8: Reconstructed posteriors (filled curves) and point estimates (vertical lines) for three [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Empirical coverage at the 90% nominal level as a function of the true value of each observable. Shaded bands show binomial uncertainty per bin. The rightmost bins for p tt¯ T and mtt¯ have low statistics as indicated by the uncertainty bands. the high-p tt¯ T region (p tt¯ T ≳ 450 GeV), where both flows undercover more severely. This regime is sparsely populated in the training data, and the degraded calib… view at source ↗
read the original abstract

Evaluation in scientific reconstruction is dominated by pointwise metrics - RMSE, MAE, per-event resolution - under the implicit assumption that lower error means better reconstruction. We show that this assumption fails structurally for inverse problems with multimodal posteriors. By the law of total variance, point estimators trained to minimize MSE or MAE produce a marginal spectrum strictly narrower than the truth whenever the posterior has nonzero width. The resulting bias is independent of architecture, training, and dataset size, and it compresses precisely the spectral features - tails, modes, shapes - that downstream scientific measurements rely on. We propose a three-part evaluation protocol where each step targets a failure mode the others miss: per-event distributional accuracy via CRPS, population-level marginal accuracy via a spectrum-fidelity diagnostic, and uncertainty trustworthiness via coverage-based calibration. On a synthetic benchmark with an analytic posterior and on a realistic many-to-one inverse problem from particle physics, model rankings reverse between pointwise and distributional metrics, and calibration further separates architectures indistinguishable under CRPS. The evaluation protocol, not the model, determines the scientific conclusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript argues that pointwise metrics (RMSE, MAE) structurally mislead in multimodal inverse problems: by the law of total variance, point estimators produce marginal spectra narrower than the true posterior whenever posterior width is nonzero, with the bias independent of architecture/training/dataset size and compressing tails/modes/shapes needed for downstream science. It proposes a three-part protocol (CRPS for per-event distributional accuracy, spectrum-fidelity diagnostic for population marginal accuracy, coverage calibration for uncertainty) and demonstrates ranking reversals on a synthetic analytic-posterior benchmark and a particle-physics many-to-one inverse problem.

Significance. If the core claims hold, the work is significant for ML evaluation in scientific inverse problems (e.g., particle physics), where it shows that metric choice can reverse model rankings and alter scientific conclusions. The analytic posterior in the synthetic benchmark is a strength for exact verification. The independence claim, if rigorously established, would be a notable result.

major comments (3)
  1. [Abstract] Abstract: the claim that 'the resulting bias is independent of architecture, training, and dataset size' does not hold exactly. The law of total variance decomposition applies to the population conditional mean (MSE minimizer) but has no direct analogue for conditional medians (MAE); any finite-sample estimator only approximates the population quantity, so realized narrowing depends on N, capacity, and optimization.
  2. [Abstract] Abstract: no explicit formula, derivation, or definition is supplied for the 'spectrum-fidelity diagnostic' that forms the second leg of the proposed protocol; this quantity is load-bearing for the claim that the protocol targets failure modes missed by pointwise metrics.
  3. [Abstract] Abstract: the premise that downstream scientific measurements 'rely on' the full posterior spectrum (tails, modes, shapes) rather than low-order moments or point estimates is asserted without derivation or empirical support; this is central to the significance argument but remains an assumption.
minor comments (1)
  1. [Abstract] The abstract is information-dense; consider separating the protocol description from the bias argument for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these precise comments on the abstract. They highlight areas where greater rigor and explicitness will strengthen the manuscript. We address each point below and have made targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'the resulting bias is independent of architecture, training, and dataset size' does not hold exactly. The law of total variance decomposition applies to the population conditional mean (MSE minimizer) but has no direct analogue for conditional medians (MAE); any finite-sample estimator only approximates the population quantity, so realized narrowing depends on N, capacity, and optimization.

    Authors: We agree that the law of total variance supplies an exact population-level statement for the conditional mean (MSE minimizer) and that finite-sample estimators only approach this limit. The original wording was intended to convey that the bias is architectural- and data-size-independent once the estimator converges to the population quantity, but the phrasing was imprecise. For MAE the decomposition is not identical, though the qualitative compression of marginal spectra still occurs under multimodality. We have revised the abstract to read 'in the population limit, independent of architecture...' and added a clarifying sentence in Section 2.1 distinguishing the MSE case from the MAE case while preserving the core structural claim. revision: yes

  2. Referee: [Abstract] Abstract: no explicit formula, derivation, or definition is supplied for the 'spectrum-fidelity diagnostic' that forms the second leg of the proposed protocol; this quantity is load-bearing for the claim that the protocol targets failure modes missed by pointwise metrics.

    Authors: The spectrum-fidelity diagnostic is the integrated absolute difference between the empirical CDF of the reconstructed marginal and the true marginal CDF, evaluated over a fine grid of the observable. We have inserted a concise parenthetical definition and the explicit formula into the abstract and expanded the formal definition, including the discretization used in the experiments, in the revised Section 3.2. revision: yes

  3. Referee: [Abstract] Abstract: the premise that downstream scientific measurements 'rely on' the full posterior spectrum (tails, modes, shapes) rather than low-order moments or point estimates is asserted without derivation or empirical support; this is central to the significance argument but remains an assumption.

    Authors: The premise reflects standard practice in particle-physics unfolding and resonance extraction, where tail probabilities and spectral shapes directly enter cross-section and parameter fits. We have added a short paragraph in the introduction citing representative HEP references on the necessity of full-spectrum fidelity and included a quantitative illustration from the particle-physics benchmark showing how the compressed marginal produces a statistically significant bias in a downstream observable. While a domain-general derivation is outside the paper's scope, the revision supplies both literature grounding and empirical support. revision: partial

Circularity Check

0 steps flagged

No circularity: central claim rests on external law of total variance

full rationale

The paper derives the narrowing of the marginal spectrum for point estimators from the law of total variance, an independent mathematical identity that holds for the population conditional mean and does not reduce to any fitted parameter, self-citation, or definitional loop within the manuscript. No equations rename a known empirical pattern, smuggle an ansatz via prior work, or treat a fitted input as a prediction. The proposed evaluation protocol is introduced separately and does not depend on the variance claim for its justification. The derivation chain is therefore self-contained against external benchmarks, with any overstatement regarding finite-sample MAE behavior or dataset-size independence constituting a correctness issue rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one standard probability identity and the modeling premise that scientific value resides in the full posterior shape; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • standard math Law of total variance decomposes the marginal variance into E[Var(X|Y)] + Var(E[X|Y])
    Invoked to prove that any point estimator produces a narrower marginal spectrum.

pith-pipeline@v0.9.0 · 5741 in / 1306 out tokens · 22384 ms · 2026-05-25T05:37:52.267611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 4 internal anchors

  1. [1]

    Bishop.Pattern Recognition and Machine Learning

    Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, January 2006. URL https://www.microsoft.com/en-us/research/publication/pattern-recog nition-machine-learning/

  2. [2]

    Topological reconstruction of particle physics processes using graph neural networks.Phys

    Lukas Ehrke, John Andrew Raine, Knut Zoch, Manuel Guth, and Tobias Golling. Topological reconstruction of particle physics processes using graph neural networks.Phys. Rev. D, 107 (11):116019, 2023. doi: 10.1103/PhysRevD.107.116019

  3. [3]

    SPANet: Generalized permutationless set assignment for particle physics using symmetry preserving attention.SciPost Phys., 12(5):178, 2022

    Alexander Shmakov, Michael James Fenton, Ta-Wei Ho, Shih-Chieh Hsu, Daniel Whiteson, and Pierre Baldi. SPANet: Generalized permutationless set assignment for particle physics using symmetry preserving attention.SciPost Phys., 12(5):178, 2022. doi: 10.21468/SciPostPh ys.12.5.178

  4. [4]

    Sidky and Xiaochuan Pan

    Emil Y . Sidky and Xiaochuan Pan. Report on the aapm deep-learning sparse-view ct grand challenge.Medical Physics, 49(8):4935–4943, 2022. doi: https://doi.org/10.1002/mp.15489. URLhttps://aapm.onlinelibrary.wiley.com/doi/abs/10.1002/mp.15489

  5. [5]

    Zhihao Wang, Jian Chen, and Steven C. H. Hoi. Deep learning for image super-resolution: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3365–3387,

  6. [6]

    doi: 10.1109/TPAMI.2020.2982166

  7. [7]

    Thomas Vandal, Evan Kodra, Sangram Ganguly, Andrew Michaelis, Ramakrishna Nemani, and Auroop R. Ganguly. Generating high resolution climate change projections through single image super-resolution: an abridged version. InProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, page 5389–5393. AAAI Press, 2018. ISBN 9780...

  8. [8]

    Geophysical inversion versus machine learning in inverse problems

    Yuji Kim and Nori Nakata. Geophysical inversion versus machine learning in inverse problems. Leading Edge, 37(12):894–901, December 2018. doi: 10.1190/tle37120894.1

  9. [9]

    Statistically- informed deep learning for gravitational wave parameter estimation.Machine Learning: Science and Technology, 3(1):015007, November 2021

    Hongyu Shen, E A Huerta, Eamonn O’Shea, Prayush Kumar, and Zhizhen Zhao. Statistically- informed deep learning for gravitational wave parameter estimation.Machine Learning: Science and Technology, 3(1):015007, November 2021. doi: 10.1088/2632-2153/ac3843. URL https://doi.org/10.1088/2632-2153/ac3843

  10. [10]

    Green, Jonathan Gair, Jakob H

    Maximilian Dax, Stephen R. Green, Jonathan Gair, Jakob H. Macke, Alessandra Buonanno, and Bernhard Schölkopf. Real-time gravitational wave science with neural posterior estimation. Phys. Rev. Lett., 127:241103, December 2021. doi: 10.1103/PhysRevLett.127.241103. URL https://link.aps.org/doi/10.1103/PhysRevLett.127.241103

  11. [11]

    Fast and improved neutrino reconstruction in multineutrino final states with conditional normalizing flows.Phys

    John Andrew Raine, Matthew Leigh, Knut Zoch, and Tobias Golling. Fast and improved neutrino reconstruction in multineutrino final states with conditional normalizing flows.Phys. Rev. D, 109:012005, January 2024. doi: 10.1103/PhysRevD.109.012005. URL https: //link.aps.org/doi/10.1103/PhysRevD.109.012005

  12. [12]

    The frontier of simulation-based inference

    Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062, 2020. doi: 10.1073/ pnas.1912789117. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.1912789117

  13. [13]

    Fastϵ -free inference of simulation models with bayesian conditional density estimation

    George Papamakarios and Iain Murray. Fastϵ -free inference of simulation models with bayesian conditional density estimation. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file /6aca...

  14. [14]

    Strictly proper scoring rules, prediction, and estimation

    Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007. doi: 10.1198/016214 506000001437. URLhttps://doi.org/10.1198/016214506000001437

  15. [15]

    Decomposition of the continuous ranked probability score for ensemble prediction systems.Weather and Forecasting, 15(5):559–570, 2000

    Hans Hersbach. Decomposition of the continuous ranked probability score for ensemble prediction systems.Weather and Forecasting, 15(5):559–570, 2000. doi: 10.1175/1520-043 4(2000)015<0559:DOTCRP>2.0.CO;2. URL https://journals.ametsoc.org/view/jo urnals/wefo/15/5/1520-0434_2000_015_0559_dotcrp_2_0_co_2.xml

  16. [16]

    Evaluating probabilistic forecasts with scoringrules.Journal of Statistical Software, 90(12):1–37, 2019

    Alexander Jordan, Fabian Krüger, and Sebastian Lerch. Evaluating probabilistic forecasts with scoringrules.Journal of Statistical Software, 90(12):1–37, 2019. doi: 10.18637/jss.v090.i12. URLhttps://www.jstatsoft.org/index.php/jss/article/view/v090i12

  17. [17]

    A trust crisis in simulation-based inference? your posterior approximations can be unfaithful, 2022

    Joeri Hermans, Arnaud Delaunoy, François Rozet, Antoine Wehenkel, V olodimir Begy, and Gilles Louppe. A trust crisis in simulation-based inference? your posterior approximations can be unfaithful, 2022. URLhttps://arxiv.org/abs/2110.06581

  18. [18]

    Validating bayesian inference algorithms with simulation-based calibration, 2020

    Sean Talts, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. Validating bayesian inference algorithms with simulation-based calibration, 2020. URL https://arxiv. org/abs/1804.06788

  19. [19]

    On Calibration of Modern Neural Networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks.CoRR, abs/1706.04599, 2017. URLhttp://arxiv.org/abs/1706.04599

  20. [20]

    Accurate Uncertainties for Deep Learning Using Calibrated Regression

    V olodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression.CoRR, abs/1807.00263, 2018. URL http://arxiv.or g/abs/1807.00263

  21. [21]

    Learning by transduction

    Alex Gammerman, V olodya V ovk, and Vladimir Vapnik. Learning by transduction. InPro- ceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, UAI’98, page 148–155, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN 155860555X. 11

  22. [22]

    Probabilistic conformal prediction using conditional random samples

    Zhendong Wang, Ruijiang Gao, Mingzhang Yin, Mingyuan Zhou, and David Blei. Probabilistic conformal prediction using conditional random samples. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors,Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 ofProceedings of Machine Learning Researc...

  23. [23]

    Araz and Michael Spannowsky

    Jack Y . Araz and Michael Spannowsky. Another fit bites the dust: Conformal prediction as a calibration standard for machine learning in high-energy physics, 2025. URL https: //arxiv.org/abs/2512.17048

  24. [24]

    Benchmarking simulation-based inference

    Jan-Matthis Lueckmann, Jan Boelts, David Greenberg, Pedro Goncalves, and Jakob Macke. Benchmarking simulation-based inference. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 343–351. PMLR, May 2021. URL...

  25. [25]

    Sampling- based accuracy testing of posterior estimators for general inference

    Pablo Lemos, Adam Coogan, Yashar Hezaveh, and Laurence Perreault-Levasseur. Sampling- based accuracy testing of posterior estimators for general inference. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofPro...

  26. [26]

    Inversebench: Benchmarking plug-and-play diffusion priors for inverse problems in physical sciences

    Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy Feng, Caifeng Zou, Yu Sun, Nikola Borislavov Kovachki, Zachary E Ross, Katherine Bouman, and Yisong Yue. Inversebench: Benchmarking plug-and-play diffusion priors for inverse problems in physical sciences. InThe Thirteenth International Conference on Learning Representations, 2025. U...

  27. [27]

    Jan A. Högbom. Aperture Synthesis with a Non-Regular Distribution of Interferometer Base- lines.Astron. Astrophys. Suppl. Ser., 15:417, June 1974

  28. [28]

    Algebraic approach to solve tt dilepton equations.Phys

    Lars Sonnenschein. Algebraic approach to solve tt dilepton equations.Phys. Rev. D, 72:095020, November 2005. doi: 10.1103/PhysRevD.72.095020. URL https://link.aps.org/doi/1 0.1103/PhysRevD.72.095020

  29. [29]

    Enhanced reconstruction of dileptonic top quark-antiquark events using supervised machine learning methods

    The CMS Collaboration. Enhanced reconstruction of dileptonic top quark-antiquark events using supervised machine learning methods. Technical report, CERN, Geneva, 2025. URL https://cds.cern.ch/record/2944724

  30. [30]

    Analyzing inverse problems with invertible neural networks

    Lynton Ardizzone, Jakob Kruse, Carsten Rother, and Ullrich Köthe. Analyzing inverse problems with invertible neural networks. InInternational Conference on Learning Representations,

  31. [31]

    URLhttps://openreview.net/forum?id=rJed6j0cKX

  32. [32]

    OUP Oxford,

    Geoffrey Grimmett and David Stirzaker.Probability and Random Processes. OUP Oxford,

  33. [33]

    Stanberry, Eric P

    Tilmann Gneiting, Larissa I. Stanberry, Eric P. Grimit, Leonhard Held, and Nicholas A. Johnson. Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds.TEST, 17(2):211–235, August 2008. ISSN 1863-8260. doi: 10.1007/s11749-008-0114-x. URLhttps://doi.org/10.1007/s11749-008-0114-x

  34. [34]

    Springer New York, New York, NY ,

    Karl Pearson.On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling, pages 11–28. Springer New York, New York, NY ,

  35. [35]

    doi: 10.1007/978-1-4612-4380-9_2

    ISBN 978-1-4612-4380-9. doi: 10.1007/978-1-4612-4380-9_2. URL https: //doi.org/10.1007/978-1-4612-4380-9_2

  36. [36]

    Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for distributions with applica- tions to image databases. InSixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pages 59–66, January 1998. doi: 10.1109/ICCV.1998.710701. 12

  37. [37]

    Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E. Raftery. Probabilistic forecasts, calibration and sharpness.Journal of the Royal Statistical Society Series B, 69(2):243–268, 2007. URL https://EconPapers.repec.org/RePEc:bla:jorssb:v:69:y:2007:i:2:p:243-268

  38. [38]

    Neural spline flows

    Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/7ac71d4 33f282034e...

  39. [39]

    Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

    George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021. URL http://jmlr.org/papers/v22/19 -1028.html

  40. [40]

    How to unfold top decays.SciPost Phys

    Luigi Favaro, Roman Kogler, Alexander Paasch, Sofia Palacios Schweitzer, Tilman Plehn, and Dennis Schwarz. How to unfold top decays.SciPost Phys. Core, 8:053, 2025. doi: 10.21468/SciPostPhysCore.8.3.053. URL https://scipost.org/10.21468/SciPostPh ysCore.8.3.053

  41. [41]

    Observation of a pseudoscalar excess at the top quark pair production threshold.Reports on Progress in Physics, 88(8):087801, August 2025

    The CMS Collaboration. Observation of a pseudoscalar excess at the top quark pair production threshold.Reports on Progress in Physics, 88(8):087801, August 2025. doi: 10.1088/1361-663 3/adf7d3. URLhttps://doi.org/10.1088/1361-6633/adf7d3

  42. [42]

    DELPHES 3, A modular framework for fast simulation of a generic collider experiment

    Jerome de Favereau, Christophe Delaere, Pavel Demin, Andrea Giammanco, Vincent Lemaître, Alexandre Mertens, Michele Selvaggi, and The DELPHES 3 collaboration. Delphes 3: a modular framework for fast simulation of a generic collider experiment.Journal of High Energy Physics, 2014(2):57, 2014. doi: 10.1007/JHEP02(2014)057

  43. [43]

    Dileptonic ttbar neutrino regression dataset, July 2023

    John Andrew Raine, Matthew Leigh, Knut Zoch, Lukas Ehrke, Debajyoti Sengupta, and Tobias Golling. Dileptonic ttbar neutrino regression dataset, July 2023. URL https://doi.org/10 .5281/zenodo.8113516

  44. [44]

    Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations.Advances in Neural Information Processing Systems, 2018

  45. [45]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

  46. [46]

    The landscape of unfolding with machine learning.SciPost Phys., 18:070, 2025

    Nathan Huetsch, Javier Mariño Villadamigo, Alexander Shmakov, Sascha Diefenbacher, Vinicius Mikuni, Theo Heimel, Michael Fenton, Kevin Greif, Benjamin Nachman, Daniel Whiteson, Anja Butter, and Tilman Plehn. The landscape of unfolding with machine learning.SciPost Phys., 18:070, 2025. doi: 10.21468/SciPostPhys.18.2.070. URL https://scipost.org/10.21468/Sc...

  47. [47]

    Invertible networks or partons to detector and back again.SciPost Phys., 9:074, 2020

    Marco Bellagente, Anja Butter, Gregor Kasieczka, Tilman Plehn, Armand Rousselot, Ramon Winterhalder, Lynton Ardizzone, and Ullrich Köthe. Invertible networks or partons to detector and back again.SciPost Phys., 9:074, 2020. doi: 10.21468/SciPostPhys.9.5.074. URL https://scipost.org/10.21468/SciPostPhys.9.5.074

  48. [48]

    Generative unfolding of jets and their substructure, 2025

    Antoine Petitjean, Anja Butter, Kevin Greif, Sofia Palacios Schweitzer, Tilman Plehn, Jonas Spinner, and Daniel Whiteson. Generative unfolding of jets and their substructure, 2025. URL https://arxiv.org/abs/2510.19906

  49. [49]

    Calibrated reliable regression using maximum mean discrepancy

    Peng Cui, Wenbo Hu, and Jun Zhu. Calibrated reliable regression using maximum mean discrepancy. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17164–17175. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper /2020/file/c74c4bf0d...

  50. [50]

    Christopher A. T. Ferro, David S. Richardson, and Andreas P. Weigel. On the effect of ensemble size on the discrete and continuous ranked probability scores.Meteorological Applications, 15 (1):19–24, 2008. doi: https://doi.org/10.1002/met.45. URL https://rmets.onlinelibrar y.wiley.com/doi/abs/10.1002/met.45

  51. [51]

    Steve Baker and Robert D. Cousins. Clarification of the use of chi-square and likelihood functions in fits to histograms.Nuclear Instruments and Methods in Physics Research, 221(2): 437–442, 1984. ISSN 0167-5087. doi: https://doi.org/10.1016/0167-5087(84)90016-4. URL https://www.sciencedirect.com/science/article/pii/0167508784900164

  52. [52]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.CoRR, abs/1711.05101, 2017. URLhttp://arxiv.org/abs/1711.05101

  53. [53]

    Christopher M. Bishop. Mixture density networks. WorkingPaper 4288, Aston University, 1994

  54. [54]

    normflows: A pytorch package for normalizing flows.Journal of Open Source Software, 8(86):5361, 2023

    Vincent Stimper, David Liu, Andrew Campbell, Vincent Berenz, Lukas Ryll, Bernhard Schölkopf, and José Miguel Hernández-Lobato. normflows: A pytorch package for normalizing flows.Journal of Open Source Software, 8(86):5361, 2023. doi: 10.21105/joss.05361. URL https://doi.org/10.21105/joss.05361

  55. [55]

    A Lorentz-equivariant transformer for all of the LHC.SciPost Phys., 19:108, 2025

    Johann Brehmer, Víctor Bresó, Pim de Haan, Tilman Plehn, Huilin Qu, Jonas Spinner, and Jesse Thaler. A Lorentz-equivariant transformer for all of the LHC.SciPost Phys., 19:108, 2025. doi: 10.21468/SciPostPhys.19.4.108. URL https://scipost.org/10.21468/SciPostPhys.1 9.4.108

  56. [56]

    Lorentz-equivariant geometric algebra transformers for high-energy physics

    Jonas Spinner, Victor Bresó, Pim De Haan, Tilman Plehn, Jesse Thaler, and Johann Brehmer. Lorentz-equivariant geometric algebra transformers for high-energy physics. InAdvances in Neural Information Processing Systems, volume 37, 2024. URL https://arxiv.org/abs/ 2405.14806

  57. [57]

    Geometric algebra trans- former

    Johann Brehmer, Pim de Haan, Sönke Behrends, and Taco Cohen. Geometric algebra trans- former. InAdvances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2305.18415

  58. [58]

    Ricky T. Q. Chen. torchdiffeq, 2018. URL https://github.com/rtqichen/torchdiffeq. A Evaluation Metrics A.1 Empirical CRPS estimator For a predictive distribution represented by N posterior samples {ˆz(k)}N k=1, the CRPS of eq. (1) is estimated as \CRPS = 1 N NX k=1 ˆz(k) −z − 1 2N2 NX k=1 NX j=1 ˆz(k) −ˆz(j) ,(8) computable in O(NlogN) via sorting [15]. T...

  59. [59]

    They are the natural choice when the protocol’s univariate CRPS is replaced by the energy score (section 4.1) for joint-posterior applications

    and TARP [24] extend coverage diagnostics to the joint setting. They are the natural choice when the protocol’s univariate CRPS is replaced by the energy score (section 4.1) for joint-posterior applications. Choosing among themWe recommend to use conformal prediction when comparing across model families (its finite-sample guarantee is family-agnostic). Us...

  60. [60]

    This results in a system with six unknown degrees of freedom (the three-momentum components for each neutrino) but only two direct experimental constraints

    Underdetermined Kinematics:While the detector measures the transverse components of the sum of the neutrino momenta ⃗Emiss T , the individual longitudinal momenta (pz) and the specific distribution of transverse momentum between the two neutrinos are un- known. This results in a system with six unknown degrees of freedom (the three-momentum components for...

  61. [61]

    For n additional light-flavor jets in the event, the number of possible permutations for the final-state assignment grows factorially, creating a complex assignment problem

    Combinatorial Ambiguity:In a standard event, the detector identifies two b-jets, but it is not a priori known which jet originated from the top quark and which from the anti-top quark. For n additional light-flavor jets in the event, the number of possible permutations for the final-state assignment grows factorially, creating a complex assignment problem

  62. [62]

    kinematic fitting

    Detector Resolution and Noise:The measured momenta of jets and the ⃗Emiss T are subject to experimental uncertainties and resolution effects. Traditional analytical “kinematic fitting” methods often fail when the measured values fluctuate such that no physical solution exists for the mass constraints. D.2 Dataset details We use the public Delphes [39] Mon...

  63. [63]

    posterior mean

    architecture combining a transformer condition encoder with a stack of RQS coupling layers for the discrete flow. Regression Continuous flow Discrete flow (MSE & MMD) (flow matching) (ν 2-flow style) Condition encoder Encoder blocks8 4 4 Attention heads8 8 8 Hidden dimension128 128 128 Dropout0.1 0.1 0.1 Positional encoding dim8 8 8 Flow / decoder head De...