arxiv: 2605.12118 · v1 · submitted 2026-05-12 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Keeping Score: Efficiency Improvements in Neural Likelihood Surrogate Training via Score-Augmented Loss Functions

Alexander Shen, Mikael Kuusela

Pith reviewed 2026-05-13 04:01 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords simulation-based inferenceneural likelihoodscore augmentationstochastic processeslikelihood-free inferencebinary cross-entropy

0 comments

The pith

Augmenting loss functions with score gradients improves neural likelihood surrogates at lower cost than more simulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to train better neural network approximations to likelihoods in simulation-based inference by using exact score information. For models like network dynamics and spatial processes, the scores can be calculated exactly. Adding these scores to the standard training loss with adaptive weighting leads to higher quality surrogates. The method achieves inference performance comparable to using ten times as much training data while increasing training time by less than ten percent. This exploits model structure instead of treating the simulator as a complete black box.

Core claim

By augmenting the binary cross-entropy loss with the exact score ∇_θ log p(x | θ) and adaptive weighting based on loss gradients, neural likelihood surrogates achieve improved quality for stochastic process models at drastically lower computational cost than generating additional training data.

What carries the argument

The score-augmented loss, which combines standard binary cross-entropy with terms using the exact gradient of the log-likelihood with respect to parameters θ.

Load-bearing premise

Exact score information, the gradient of the log-likelihood with respect to parameters, is available and computable for the models being studied.

What would settle it

A direct comparison showing that the augmented loss does not improve surrogate performance or downstream inference beyond what additional data provides would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2605.12118 by Alexander Shen, Mikael Kuusela.

**Figure 1.** Figure 1: shows example data generated by each SPM in the following case studies. Unobserved State Changes Observed Data Time Random Intervals Fixed Intervals (a) SIS continuous-time hidden Markov model (b) Gaussian/Student-t Process [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: CT-HMM for SIS epidemics L-test set metrics for different NLER configurations: (Left) L-test set LBCE. (Center-left) L-test set LScore for infection rate λ. (Center-right) L-test set LScore for recovery rate µ. — For each NLER neural network size (vertical lanes) and training dataset size (arrow colors), the indicated metric is plotted for the NLER trained under LBCE (arrow tail) vs. trained under LASA (ar… view at source ↗

**Figure 3.** Figure 3: CT-HMM for SIS epidemics hypothesis testing and confidence set metrics for different NLER configurations: (Left) E-test likelihood ratio test statistic (LRTS) metrics - MSE between NLER LRTS ΛNLER(x, θ) and ground truth (GT) LRTS ΛGT (x, θ). (Center-left) Average empirical coverage of 95% Wilks’ confidence sets over E-test data, compared to nominal 95% coverage (horizontal dashed line). (Center-right) Aver… view at source ↗

**Figure 4.** Figure 4: CT-HMM for SIS epidemics maximum likelihood estimator (MLE) metrics for different NLER configurations: (Left & Center-left) Median pointwise squared error between NLER MLE and ground truth (GT) MLE for infection rate (Left) and recovery rate (Center-left). — For each θ value in the E-test set, we calculate the mean squared error between the NLER MLE and the GT MLE. We then take the median over all θ values… view at source ↗

**Figure 5.** Figure 5: shows the L-test set metrics for this experimental setting. As with the previous case study, we see broad improvements in NLER performance without a significant increase in training cost. For E-test metrics and additional training time metrics on this case study, see Figures 7, 8, and 12 in Appendix E.1 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: shows L-test set metrics for the Student-t process. As with the previous case study, we see broad improvements in NLER performance without a significant increase in training cost. For E-test metrics and additional training time metrics on this case study, see Figures 9, 10, and 13 in Appendix E.2 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Gaussian spatial process hypothesis testing and confidence set metrics for different NLER configurations (Left) E-test likelihood ratio test statistic (LRTS) metrics - MSE between NLER LRTS ΛNLER(x, θ) and ground truth (GT) LRTS ΛGT (x, θ). (Center-left) Average empirical coverage of 95% Wilks’ confidence sets over E-test data, compared to nominal 95% coverage (horizontal dashed line). (Center-right) Avera… view at source ↗

**Figure 8.** Figure 8: Gaussian spatial process maximum likelihood estimator (MLE) metrics for different NLER Configurations: (Left & Center-left) Median pointwise squared error between NLER MLE and ground truth (GT) MLE for length scales (Left) and nugget variance (Center-Left). — Plots are formatted identically to [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Student-t spatial process hypothesis testing and confidence set metrics for different NLER configurations (Left) E-test likelihood ratio test statistic (LRTS) metrics - MSE between NLER LRTS ΛNLER(x, θ) and ground truth (GT) LRTS ΛGT (x, θ). (Center-left) Average empirical coverage of 95% Wilks’ confidence sets over E-test data, compared to nominal 95% coverage (horizontal dashed line). (Center-Right) Aver… view at source ↗

**Figure 10.** Figure 10: Student-t spatial process maximum likelihood estimator (MLE) metrics for different NLER Configurations: (Left & Center-left) Median pointwise square error between NLER MLE and ground truth (GT) MLE for length scales (Left) and degrees of freedom (Center-left). — Plots are formatted identically to [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 4.** Figure 4: Both NLERs are size 30K trained on 10K points. We see that [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 11.** Figure 11: NLER training time metrics for SIS epidemic case study: (Left) Time spent per batch during NLER training. (Center) Total time spent performing NLER training .(Right) Total time elapsed between start of NLER training and completion of epoch with best validation LBCE. — NLER network size and training dataset sizes are displayed identically to [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: NLER training time metrics for Gaussian spatial process case study: (Left) Time spent per batch during NLER training. (Center) Total time spent performing NLER training. (Right) Total time elapsed between start of NLER training and completion of epoch with best validation LBCE. — Plots are formatted in same manner as [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: NLER training time metrics for Student-t spatial process case study: (Left) Time spent per batch during NLER training. (Center) Total time spent performing NLER training. (Right) Total time elapsed between start of NLER training and completion of epoch with best validation LBCE. — Plots are formatted in same manner as [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

read the original abstract

For stochastic process models, parameter inference is often severely bottlenecked by computationally expensive likelihood functions. Simulation-based inference (SBI) bypasses this restriction by constructing amortized surrogate likelihoods, but most SBI methods assume a black-box data generating process. While these surrogates are exact in the limit of infinite training data, practical scenarios force a strict tradeoff between model quality and simulation cost. In this work, we loosen the black-box assumption of SBI to improve this tradeoff for structured stochastic process models. Specifically, for neural network likelihood surrogates trained via probabilistic classification, we propose to augment the standard binary cross-entropy loss with exact score information $\nabla_\theta \log p(x \mid \theta)$ and adaptive weighting based on loss gradients. We evaluate our approach on case studies involving network dynamics and spatial processes, demonstrating that our method improves surrogate quality at a drastically lower computational cost than generating more training data. Notably, in some cases, our approach achieves downstream inference performance equivalent to a 10x increase in training data with less than a 1.1x increase in training time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Score augmentation with adaptive weighting improves neural SBI surrogate efficiency when exact gradients are available, delivering solid gains on the tested structured models but with a narrowed scope.

read the letter

The main thing to know is that this paper shows how to use exact score gradients to make neural likelihood surrogates in simulation-based inference more efficient for models where those scores can be computed. They augment the standard binary cross-entropy loss with the term ∇_θ log p(x | θ) plus adaptive weighting based on loss gradients. This combination is new in the SBI surrogate training literature for structured stochastic processes, and the authors are clear that it loosens the usual black-box assumption. On the network dynamics and spatial process case studies, they report downstream inference performance that matches a 10x increase in training data while adding less than 1.1x training time in some cases. The paper does well by targeting the simulation cost bottleneck directly and evaluating on actual inference tasks rather than stopping at surrogate metrics alone. The approach builds on the probabilistic classification setup that many SBI methods already use, so the change is straightforward to try if scores are on hand. The soft spots are proportionate to the claims. The method requires exact score information, which the authors flag upfront, so it only applies where the data generating process allows cheap gradient computation. The efficiency results are qualified as holding “in some cases,” which is honest but means the 10x equivalence may not generalize without further testing. Experimental details on baselines, run-to-run variance, and whether the adaptive weighting introduces hidden costs would need checking in the full text to confirm robustness. The citation pattern looks standard and does not appear circular. This paper is for researchers working on simulation-based inference for structured stochastic models who already have or can obtain score information. Readers facing tight simulation budgets in statistics or scientific computing will get the most out of the concrete efficiency numbers. It deserves a serious referee because it identifies a real practical problem and offers a targeted, implementable fix with empirical support on relevant examples. I would recommend sending it for peer review.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes augmenting the binary cross-entropy loss used to train neural likelihood surrogates in simulation-based inference with exact score information ∇_θ log p(x | θ) together with adaptive weighting derived from loss gradients. The approach is targeted at structured stochastic process models for which the score is computable, and is evaluated on case studies involving network dynamics and spatial processes. The central empirical claim is that the resulting surrogates achieve downstream inference performance equivalent to a 10× increase in training data while incurring less than a 1.1× increase in training time.

Significance. If the reported efficiency gains are robust, the work provides a practical way to improve the simulation-data versus compute tradeoff in SBI for models that admit exact score evaluation. This loosens the strict black-box assumption in a controlled manner and could reduce the simulation budgets required for accurate amortized inference in domains such as network dynamics and spatial statistics.

minor comments (3)

[§3.2] §3.2: the precise form of the adaptive weighting term (how loss-gradient magnitudes are normalized and combined with the score term) is not stated explicitly enough for immediate reproduction; an equation or pseudocode block would help.
[§4.1] §4.1 and Table 1: the time measurements should clarify whether score computation overhead is included in the reported 1.1× factor and whether the 10× data baseline uses the same network architecture and optimization schedule.
The manuscript would benefit from a short discussion of the computational cost of obtaining the exact score for the two case-study models, even if this cost is assumed to be negligible relative to simulation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work and for recommending minor revision. We are encouraged that the efficiency gains from score-augmented losses for neural likelihood surrogates in structured stochastic process models are viewed as potentially impactful for loosening the black-box assumption in SBI in a controlled way.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a score-augmented binary cross-entropy loss for training neural likelihood surrogates, using exact score information as an explicit additional input for structured models where it is available. All reported performance gains (including the 10x data equivalence claim) are empirical outcomes from controlled experiments on network dynamics and spatial processes, comparing the modified training procedure against standard baselines with matched simulation budgets. No step in the method or evaluation reduces by construction to a fitted parameter, self-citation chain, or renamed input; the black-box loosening is stated upfront and the results are qualified as holding in some cases. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the availability of exact score information for structured models and on the assumption that neural networks trained via augmented classification can produce useful amortized likelihood surrogates.

axioms (2)

domain assumption Exact score ∇_θ log p(x | θ) can be computed for the target structured stochastic processes
This is the explicit loosening of the black-box assumption stated in the abstract.
domain assumption Neural network classifiers trained on simulated data can serve as amortized likelihood surrogates
Standard premise of SBI methods referenced in the abstract.

pith-pipeline@v0.9.0 · 5489 in / 1339 out tokens · 75779 ms · 2026-05-13T04:01:06.580016+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
we propose an adaptively weighted augmentation to the BCE loss that directly incorporates the SPM score function ∇θ logp(x|θ)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear
LASA = LBCE + LScore with LScore = Σ αi,k [∇θk hγ − ∇θk log p]^2

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

[1]

Zou and T

Mogens Bladt and Michael Sørensen. Statistical inference for discretely observed markov jump processes.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):395–410, 2005. doi: https://doi.org/10.1111/j.1467-9868.2005.00508.x. URL https: //rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2005.00508.x

work page doi:10.1111/j.1467-9868.2005.00508.x 2005
[2]

Jan Boelts, Philipp Harth, Richard Gao, Daniel Udvary, Felipe Yáñez, Daniel Baum, Hans- Christian Hege, Marcel Oberlaender, and Jakob H. Macke. Simulation-based inference for efficient identification of generative models in computational connectomics.PLOS Computational Biology, 19(9):1–28, 09 2023. doi: 10.1371/journal.pcbi.1011406. URL https://doi.org/10...

work page doi:10.1371/journal.pcbi.1011406 2023
[3]

Mining gold from implicit models to improve likelihood-free inference.Proceedings of the National Academy of Sciences, 117(10):5242–5249, 2020

Johann Brehmer, Gilles Louppe, Juan Pavez, and Kyle Cranmer. Mining gold from implicit models to improve likelihood-free inference.Proceedings of the National Academy of Sciences, 117(10):5242–5249, 2020. doi: 10.1073/pnas.1915980117. URL https://www.pnas.org/ doi/abs/10.1073/pnas.1915980117

work page doi:10.1073/pnas.1915980117 2020
[4]

GradNorm: Gra- dient normalization for adaptive loss balancing in deep multitask networks

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. GradNorm: Gra- dient normalization for adaptive loss balancing in deep multitask networks. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 794–803. PMLR, 10–15 Jul...

work page 2018
[5]

The frontier of simulation-based inference

Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062, 2020. doi: 10.1073/ pnas.1912789117. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.1912789117

work page doi:10.1073/pnas.1912789117 2020
[6]

Gelfand, M

A.E. Gelfand, M. Fuentes, P. Guttorp, and P. Diggle.Handbook of Spatial Statistics. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. Taylor & Francis, 2010. ISBN 9781420072877. URLhttp://books.google.com/books?id=EFbbcMFZ2mMC

work page 2010
[7]

Fast covariance parameter estimation of spatial gaussian process models using neural networks.Stat, 10, 04 2021

Florian Gerber and Doug Nychka. Fast covariance parameter estimation of spatial gaussian process models using neural networks.Stat, 10, 04 2021. doi: 10.1002/sta4.382

work page doi:10.1002/sta4.382 2021
[8]

Neural simulation-based inference of the higgs trilinear self-coupling via off-shell higgs production.Eur

Aishik Ghosh, Maximilian Griese, Ulrich Haisch, and Tae Hyoun Park. Neural simulation-based inference of the higgs trilinear self-coupling via off-shell higgs production.Eur. Phys. J. C Part. Fields, 86(4), April 2026

work page 2026
[9]

Likelihood-free MCMC with amortized approximate ratio estimators

Joeri Hermans, V olodimir Begy, and Gilles Louppe. Likelihood-free MCMC with amortized approximate ratio estimators. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 4239–4248. PMLR, 13–18 Jul 2020. URL https://proceedings. mlr.press...

work page 2020
[10]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Amanda Lenzi, Julie Bessac, Johann Rudi, and Michael L. Stein. Neural networks for parameter estimation in intractable models.Computational Statistics & Data Analysis, 185:107762,

work page
[12]

doi: https://doi.org/10.1016/j.csda.2023.107762

ISSN 0167-9473. doi: https://doi.org/10.1016/j.csda.2023.107762. URL https://www. sciencedirect.com/science/article/pii/S0167947323000737

work page doi:10.1016/j.csda.2023.107762 2023
[13]

Curran Associates Inc., Red Hook, NY , USA, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.PyTorch: an imperative style, high-performan...

work page 2019
[14]

Neural bayes estimators for censored inference with peaks-over-threshold models.Journal of Machine Learning Research, 25(390):1–49, 2024

Jordan Richards, Matthew Sainsbury-Dale, Andrew Zammit-Mangion, and Raphaël Huser. Neural bayes estimators for censored inference with peaks-over-threshold models.Journal of Machine Learning Research, 25(390):1–49, 2024. URL http://jmlr.org/papers/v25/ 23-1134.html. 10

work page 2024
[15]

Neural bayes estimators for irregular spatial data using graph neural networks.Journal of Computational and Graphical Statistics, 34(3):1153–1168, 2025

Matthew Sainsbury-Dale, Andrew Zammit-Mangion, Jordan Richards, and Raphaël Huser. Neural bayes estimators for irregular spatial data using graph neural networks.Journal of Computational and Graphical Statistics, 34(3):1153–1168, 2025

work page 2025
[16]

Neural parameter estimation with incomplete data, 2026

Matthew Sainsbury-Dale, Andrew Zammit-Mangion, Noel Cressie, and Raphaël Huser. Neural parameter estimation with incomplete data, 2026. URL https://arxiv.org/abs/2501. 04330

work page 2026
[17]

Student-t Processes as Alternatives to Gaussian Processes

Amar Shah, Andrew Wilson, and Zoubin Ghahramani. Student-t Processes as Alternatives to Gaussian Processes. In Samuel Kaski and Jukka Corander, editors,Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 877–885, Reykjavik, Iceland, 22–25 Apr

work page
[18]

URLhttps://proceedings.mlr.press/v33/shah14.html

PMLR. URLhttps://proceedings.mlr.press/v33/shah14.html

work page
[19]

Deep calibration of market simulations using neural density es- timators and embedding networks

Namid R Stillman, Rory Baggott, Justin Lyon, Jianfei Zhang, Dingqui Zhu, Tao Chen, and Perukrishnen Vytelingum. Deep calibration of market simulations using neural density es- timators and embedding networks. InProceedings of the Fourth ACM International Con- ference on AI in Finance, ICAIF ’23, page 46–54, New York, NY , USA, 2023. Association for Comput...

work page doi:10.1145/3604237.3626881 2023
[20]

Neural likelihood surfaces for spatial processes with computationally intensive or intractable likelihoods.Spatial Statistics, 62: 100848, 2024

Julia Walchessen, Amanda Lenzi, and Mikael Kuusela. Neural likelihood surfaces for spatial processes with computationally intensive or intractable likelihoods.Spatial Statistics, 62: 100848, 2024. ISSN 2211-6753. doi: https://doi.org/10.1016/j.spasta.2024.100848. URL https://www.sciencedirect.com/science/article/pii/S2211675324000393

work page doi:10.1016/j.spasta.2024.100848 2024
[21]

Sis epidemic propagation on scale-free hypernet- work.Applied Sciences, 12(21), 2022

Kaijun Wang, Yunchao Gong, and Feng Hu. Sis epidemic propagation on scale-free hypernet- work.Applied Sciences, 12(21), 2022. ISSN 2076-3417. doi: 10.3390/app122110934. URL https://www.mdpi.com/2076-3417/12/21/10934

work page doi:10.3390/app122110934 2022
[22]

Neural meth- ods for amortized inference.Annual Review of Statistics and Its Application, 12 (V olume 12, 2025):311–335, 2025

Andrew Zammit-Mangion, Matthew Sainsbury-Dale, and Raphaël Huser. Neural meth- ods for amortized inference.Annual Review of Statistics and Its Application, 12 (V olume 12, 2025):311–335, 2025. ISSN 2326-831X. doi: https://doi.org/10.1146/ annurev-statistics-112723-034123. URL https://www.annualreviews.org/content/ journals/10.1146/annurev-statistics-112723-034123

work page doi:10.1146/annurev-statistics-112723-034123 2025
[23]

Neural posterior estimation with differentiable simulators, 2022

Justine Zeghal, François Lanusse, Alexandre Boucaud, Benjamin Remy, and Eric Aubourg. Neural posterior estimation with differentiable simulators, 2022. URL https://arxiv.org/ abs/2207.05636. 11 A Derivations for NLER training via binary classification A neural likelihood-to-evidence ratio estimator attempts to approximate p(x|θ) p(x) (or some simple trans...

work page arXiv 2022