pith. machine review for the scientific record. sign in

arxiv: 2605.12118 · v1 · submitted 2026-05-12 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Keeping Score: Efficiency Improvements in Neural Likelihood Surrogate Training via Score-Augmented Loss Functions

Alexander Shen, Mikael Kuusela

Pith reviewed 2026-05-13 04:01 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords simulation-based inferenceneural likelihoodscore augmentationstochastic processeslikelihood-free inferencebinary cross-entropy
0
0 comments X

The pith

Augmenting loss functions with score gradients improves neural likelihood surrogates at lower cost than more simulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to train better neural network approximations to likelihoods in simulation-based inference by using exact score information. For models like network dynamics and spatial processes, the scores can be calculated exactly. Adding these scores to the standard training loss with adaptive weighting leads to higher quality surrogates. The method achieves inference performance comparable to using ten times as much training data while increasing training time by less than ten percent. This exploits model structure instead of treating the simulator as a complete black box.

Core claim

By augmenting the binary cross-entropy loss with the exact score ∇_θ log p(x | θ) and adaptive weighting based on loss gradients, neural likelihood surrogates achieve improved quality for stochastic process models at drastically lower computational cost than generating additional training data.

What carries the argument

The score-augmented loss, which combines standard binary cross-entropy with terms using the exact gradient of the log-likelihood with respect to parameters θ.

Load-bearing premise

Exact score information, the gradient of the log-likelihood with respect to parameters, is available and computable for the models being studied.

What would settle it

A direct comparison showing that the augmented loss does not improve surrogate performance or downstream inference beyond what additional data provides would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2605.12118 by Alexander Shen, Mikael Kuusela.

Figure 1
Figure 1. Figure 1: shows example data generated by each SPM in the following case studies. Unobserved State Changes Observed Data Time Random Intervals Fixed Intervals (a) SIS continuous-time hidden Markov model (b) Gaussian/Student-t Process [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CT-HMM for SIS epidemics L-test set metrics for different NLER configurations: (Left) L-test set LBCE. (Center-left) L-test set LScore for infection rate λ. (Center-right) L-test set LScore for recovery rate µ. — For each NLER neural network size (vertical lanes) and training dataset size (arrow colors), the indicated metric is plotted for the NLER trained under LBCE (arrow tail) vs. trained under LASA (ar… view at source ↗
Figure 3
Figure 3. Figure 3: CT-HMM for SIS epidemics hypothesis testing and confidence set metrics for different NLER configurations: (Left) E-test likelihood ratio test statistic (LRTS) metrics - MSE between NLER LRTS ΛNLER(x, θ) and ground truth (GT) LRTS ΛGT (x, θ). (Center-left) Average empirical coverage of 95% Wilks’ confidence sets over E-test data, compared to nominal 95% coverage (horizontal dashed line). (Center-right) Aver… view at source ↗
Figure 4
Figure 4. Figure 4: CT-HMM for SIS epidemics maximum likelihood estimator (MLE) metrics for different NLER configurations: (Left & Center-left) Median pointwise squared error between NLER MLE and ground truth (GT) MLE for infection rate (Left) and recovery rate (Center-left). — For each θ value in the E-test set, we calculate the mean squared error between the NLER MLE and the GT MLE. We then take the median over all θ values… view at source ↗
Figure 5
Figure 5. Figure 5: shows the L-test set metrics for this experimental setting. As with the previous case study, we see broad improvements in NLER performance without a significant increase in training cost. For E-test metrics and additional training time metrics on this case study, see Figures 7, 8, and 12 in Appendix E.1 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: shows L-test set metrics for the Student-t process. As with the previous case study, we see broad improvements in NLER performance without a significant increase in training cost. For E-test metrics and additional training time metrics on this case study, see Figures 9, 10, and 13 in Appendix E.2 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Gaussian spatial process hypothesis testing and confidence set metrics for different NLER configurations (Left) E-test likelihood ratio test statistic (LRTS) metrics - MSE between NLER LRTS ΛNLER(x, θ) and ground truth (GT) LRTS ΛGT (x, θ). (Center-left) Average empirical coverage of 95% Wilks’ confidence sets over E-test data, compared to nominal 95% coverage (horizontal dashed line). (Center-right) Avera… view at source ↗
Figure 8
Figure 8. Figure 8: Gaussian spatial process maximum likelihood estimator (MLE) metrics for different NLER Configurations: (Left & Center-left) Median pointwise squared error between NLER MLE and ground truth (GT) MLE for length scales (Left) and nugget variance (Center-Left). — Plots are formatted identically to [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Student-t spatial process hypothesis testing and confidence set metrics for different NLER configurations (Left) E-test likelihood ratio test statistic (LRTS) metrics - MSE between NLER LRTS ΛNLER(x, θ) and ground truth (GT) LRTS ΛGT (x, θ). (Center-left) Average empirical coverage of 95% Wilks’ confidence sets over E-test data, compared to nominal 95% coverage (horizontal dashed line). (Center-Right) Aver… view at source ↗
Figure 10
Figure 10. Figure 10: Student-t spatial process maximum likelihood estimator (MLE) metrics for different NLER Configurations: (Left & Center-left) Median pointwise square error between NLER MLE and ground truth (GT) MLE for length scales (Left) and degrees of freedom (Center-left). — Plots are formatted identically to [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 4
Figure 4. Figure 4: Both NLERs are size 30K trained on 10K points. We see that [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 11
Figure 11. Figure 11: NLER training time metrics for SIS epidemic case study: (Left) Time spent per batch during NLER training. (Center) Total time spent performing NLER training .(Right) Total time elapsed between start of NLER training and completion of epoch with best validation LBCE. — NLER network size and training dataset sizes are displayed identically to [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: NLER training time metrics for Gaussian spatial process case study: (Left) Time spent per batch during NLER training. (Center) Total time spent performing NLER training. (Right) Total time elapsed between start of NLER training and completion of epoch with best validation LBCE. — Plots are formatted in same manner as [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: NLER training time metrics for Student-t spatial process case study: (Left) Time spent per batch during NLER training. (Center) Total time spent performing NLER training. (Right) Total time elapsed between start of NLER training and completion of epoch with best validation LBCE. — Plots are formatted in same manner as [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
read the original abstract

For stochastic process models, parameter inference is often severely bottlenecked by computationally expensive likelihood functions. Simulation-based inference (SBI) bypasses this restriction by constructing amortized surrogate likelihoods, but most SBI methods assume a black-box data generating process. While these surrogates are exact in the limit of infinite training data, practical scenarios force a strict tradeoff between model quality and simulation cost. In this work, we loosen the black-box assumption of SBI to improve this tradeoff for structured stochastic process models. Specifically, for neural network likelihood surrogates trained via probabilistic classification, we propose to augment the standard binary cross-entropy loss with exact score information $\nabla_\theta \log p(x \mid \theta)$ and adaptive weighting based on loss gradients. We evaluate our approach on case studies involving network dynamics and spatial processes, demonstrating that our method improves surrogate quality at a drastically lower computational cost than generating more training data. Notably, in some cases, our approach achieves downstream inference performance equivalent to a 10x increase in training data with less than a 1.1x increase in training time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes augmenting the binary cross-entropy loss used to train neural likelihood surrogates in simulation-based inference with exact score information ∇_θ log p(x | θ) together with adaptive weighting derived from loss gradients. The approach is targeted at structured stochastic process models for which the score is computable, and is evaluated on case studies involving network dynamics and spatial processes. The central empirical claim is that the resulting surrogates achieve downstream inference performance equivalent to a 10× increase in training data while incurring less than a 1.1× increase in training time.

Significance. If the reported efficiency gains are robust, the work provides a practical way to improve the simulation-data versus compute tradeoff in SBI for models that admit exact score evaluation. This loosens the strict black-box assumption in a controlled manner and could reduce the simulation budgets required for accurate amortized inference in domains such as network dynamics and spatial statistics.

minor comments (3)
  1. [§3.2] §3.2: the precise form of the adaptive weighting term (how loss-gradient magnitudes are normalized and combined with the score term) is not stated explicitly enough for immediate reproduction; an equation or pseudocode block would help.
  2. [§4.1] §4.1 and Table 1: the time measurements should clarify whether score computation overhead is included in the reported 1.1× factor and whether the 10× data baseline uses the same network architecture and optimization schedule.
  3. The manuscript would benefit from a short discussion of the computational cost of obtaining the exact score for the two case-study models, even if this cost is assumed to be negligible relative to simulation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work and for recommending minor revision. We are encouraged that the efficiency gains from score-augmented losses for neural likelihood surrogates in structured stochastic process models are viewed as potentially impactful for loosening the black-box assumption in SBI in a controlled way.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a score-augmented binary cross-entropy loss for training neural likelihood surrogates, using exact score information as an explicit additional input for structured models where it is available. All reported performance gains (including the 10x data equivalence claim) are empirical outcomes from controlled experiments on network dynamics and spatial processes, comparing the modified training procedure against standard baselines with matched simulation budgets. No step in the method or evaluation reduces by construction to a fitted parameter, self-citation chain, or renamed input; the black-box loosening is stated upfront and the results are qualified as holding in some cases. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the availability of exact score information for structured models and on the assumption that neural networks trained via augmented classification can produce useful amortized likelihood surrogates.

axioms (2)
  • domain assumption Exact score ∇_θ log p(x | θ) can be computed for the target structured stochastic processes
    This is the explicit loosening of the black-box assumption stated in the abstract.
  • domain assumption Neural network classifiers trained on simulated data can serve as amortized likelihood surrogates
    Standard premise of SBI methods referenced in the abstract.

pith-pipeline@v0.9.0 · 5489 in / 1339 out tokens · 75779 ms · 2026-05-13T04:01:06.580016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

  1. [1]

    Zou and T

    Mogens Bladt and Michael Sørensen. Statistical inference for discretely observed markov jump processes.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):395–410, 2005. doi: https://doi.org/10.1111/j.1467-9868.2005.00508.x. URL https: //rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2005.00508.x

  2. [2]

    Jan Boelts, Philipp Harth, Richard Gao, Daniel Udvary, Felipe Yáñez, Daniel Baum, Hans- Christian Hege, Marcel Oberlaender, and Jakob H. Macke. Simulation-based inference for efficient identification of generative models in computational connectomics.PLOS Computational Biology, 19(9):1–28, 09 2023. doi: 10.1371/journal.pcbi.1011406. URL https://doi.org/10...

  3. [3]

    Mining gold from implicit models to improve likelihood-free inference.Proceedings of the National Academy of Sciences, 117(10):5242–5249, 2020

    Johann Brehmer, Gilles Louppe, Juan Pavez, and Kyle Cranmer. Mining gold from implicit models to improve likelihood-free inference.Proceedings of the National Academy of Sciences, 117(10):5242–5249, 2020. doi: 10.1073/pnas.1915980117. URL https://www.pnas.org/ doi/abs/10.1073/pnas.1915980117

  4. [4]

    GradNorm: Gra- dient normalization for adaptive loss balancing in deep multitask networks

    Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. GradNorm: Gra- dient normalization for adaptive loss balancing in deep multitask networks. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 794–803. PMLR, 10–15 Jul...

  5. [5]

    The frontier of simulation-based inference

    Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062, 2020. doi: 10.1073/ pnas.1912789117. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.1912789117

  6. [6]

    Gelfand, M

    A.E. Gelfand, M. Fuentes, P. Guttorp, and P. Diggle.Handbook of Spatial Statistics. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. Taylor & Francis, 2010. ISBN 9781420072877. URLhttp://books.google.com/books?id=EFbbcMFZ2mMC

  7. [7]

    Fast covariance parameter estimation of spatial gaussian process models using neural networks.Stat, 10, 04 2021

    Florian Gerber and Doug Nychka. Fast covariance parameter estimation of spatial gaussian process models using neural networks.Stat, 10, 04 2021. doi: 10.1002/sta4.382

  8. [8]

    Neural simulation-based inference of the higgs trilinear self-coupling via off-shell higgs production.Eur

    Aishik Ghosh, Maximilian Griese, Ulrich Haisch, and Tae Hyoun Park. Neural simulation-based inference of the higgs trilinear self-coupling via off-shell higgs production.Eur. Phys. J. C Part. Fields, 86(4), April 2026

  9. [9]

    Likelihood-free MCMC with amortized approximate ratio estimators

    Joeri Hermans, V olodimir Begy, and Gilles Louppe. Likelihood-free MCMC with amortized approximate ratio estimators. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 4239–4248. PMLR, 13–18 Jul 2020. URL https://proceedings. mlr.press...

  10. [10]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

  11. [11]

    Amanda Lenzi, Julie Bessac, Johann Rudi, and Michael L. Stein. Neural networks for parameter estimation in intractable models.Computational Statistics & Data Analysis, 185:107762,

  12. [12]

    doi: https://doi.org/10.1016/j.csda.2023.107762

    ISSN 0167-9473. doi: https://doi.org/10.1016/j.csda.2023.107762. URL https://www. sciencedirect.com/science/article/pii/S0167947323000737

  13. [13]

    Curran Associates Inc., Red Hook, NY , USA, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.PyTorch: an imperative style, high-performan...

  14. [14]

    Neural bayes estimators for censored inference with peaks-over-threshold models.Journal of Machine Learning Research, 25(390):1–49, 2024

    Jordan Richards, Matthew Sainsbury-Dale, Andrew Zammit-Mangion, and Raphaël Huser. Neural bayes estimators for censored inference with peaks-over-threshold models.Journal of Machine Learning Research, 25(390):1–49, 2024. URL http://jmlr.org/papers/v25/ 23-1134.html. 10

  15. [15]

    Neural bayes estimators for irregular spatial data using graph neural networks.Journal of Computational and Graphical Statistics, 34(3):1153–1168, 2025

    Matthew Sainsbury-Dale, Andrew Zammit-Mangion, Jordan Richards, and Raphaël Huser. Neural bayes estimators for irregular spatial data using graph neural networks.Journal of Computational and Graphical Statistics, 34(3):1153–1168, 2025

  16. [16]

    Neural parameter estimation with incomplete data, 2026

    Matthew Sainsbury-Dale, Andrew Zammit-Mangion, Noel Cressie, and Raphaël Huser. Neural parameter estimation with incomplete data, 2026. URL https://arxiv.org/abs/2501. 04330

  17. [17]

    Student-t Processes as Alternatives to Gaussian Processes

    Amar Shah, Andrew Wilson, and Zoubin Ghahramani. Student-t Processes as Alternatives to Gaussian Processes. In Samuel Kaski and Jukka Corander, editors,Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 877–885, Reykjavik, Iceland, 22–25 Apr

  18. [18]

    URLhttps://proceedings.mlr.press/v33/shah14.html

    PMLR. URLhttps://proceedings.mlr.press/v33/shah14.html

  19. [19]

    Deep calibration of market simulations using neural density es- timators and embedding networks

    Namid R Stillman, Rory Baggott, Justin Lyon, Jianfei Zhang, Dingqui Zhu, Tao Chen, and Perukrishnen Vytelingum. Deep calibration of market simulations using neural density es- timators and embedding networks. InProceedings of the Fourth ACM International Con- ference on AI in Finance, ICAIF ’23, page 46–54, New York, NY , USA, 2023. Association for Comput...

  20. [20]

    Neural likelihood surfaces for spatial processes with computationally intensive or intractable likelihoods.Spatial Statistics, 62: 100848, 2024

    Julia Walchessen, Amanda Lenzi, and Mikael Kuusela. Neural likelihood surfaces for spatial processes with computationally intensive or intractable likelihoods.Spatial Statistics, 62: 100848, 2024. ISSN 2211-6753. doi: https://doi.org/10.1016/j.spasta.2024.100848. URL https://www.sciencedirect.com/science/article/pii/S2211675324000393

  21. [21]

    Sis epidemic propagation on scale-free hypernet- work.Applied Sciences, 12(21), 2022

    Kaijun Wang, Yunchao Gong, and Feng Hu. Sis epidemic propagation on scale-free hypernet- work.Applied Sciences, 12(21), 2022. ISSN 2076-3417. doi: 10.3390/app122110934. URL https://www.mdpi.com/2076-3417/12/21/10934

  22. [22]

    Neural meth- ods for amortized inference.Annual Review of Statistics and Its Application, 12 (V olume 12, 2025):311–335, 2025

    Andrew Zammit-Mangion, Matthew Sainsbury-Dale, and Raphaël Huser. Neural meth- ods for amortized inference.Annual Review of Statistics and Its Application, 12 (V olume 12, 2025):311–335, 2025. ISSN 2326-831X. doi: https://doi.org/10.1146/ annurev-statistics-112723-034123. URL https://www.annualreviews.org/content/ journals/10.1146/annurev-statistics-112723-034123

  23. [23]

    Neural posterior estimation with differentiable simulators, 2022

    Justine Zeghal, François Lanusse, Alexandre Boucaud, Benjamin Remy, and Eric Aubourg. Neural posterior estimation with differentiable simulators, 2022. URL https://arxiv.org/ abs/2207.05636. 11 A Derivations for NLER training via binary classification A neural likelihood-to-evidence ratio estimator attempts to approximate p(x|θ) p(x) (or some simple trans...