pith. machine review for the scientific record. sign in

arxiv: 2605.06392 · v1 · submitted 2026-05-07 · 💻 cs.DC · cs.PF

Recognition: unknown

ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations

Afif Boudaoud, Alexandros Nikolaos Ziogas, Alexandru Calotoiu, H{\aa}vard Rue, Lisa Gaedke-Merzh\"auser, Marcin Copik, Mathieu Luisier, Torsten Hoefler, Vincent Maillou

Pith reviewed 2026-05-08 05:04 UTC · model grok-4.3

classification 💻 cs.DC cs.PF
keywords automatic differentiationINLALaplace approximationBayesian inferenceGPU computingsparse matriceshyperparameter optimizationlatent Gaussian models
0
0 comments X

The pith

ADELIA replaces finite-difference gradients in INLA with a structure-exploiting reverse-mode automatic differentiation pass on multi-GPU hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to apply automatic differentiation to Integrated Nested Laplace Approximations so that exact gradients for hyperparameter optimization no longer require a separate model run for each hyperparameter. Standard INLA uses central finite differences, which multiply the total work by roughly twice the number of hyperparameters and grow expensive on large sparse models. ADELIA implements a custom backward pass that reuses the same sparsity pattern already exploited in the forward pass and distributes it across GPUs. On ten benchmark models, including real air-pollution data, this produces 4.2 to 7.9 times faster gradients and lets optimization converge on models with 1.9 million latent variables where finite differences do not. The same wall-clock time budget spent on finite differences instead consumes five to eight times more energy even when the baseline is scaled to 16-32 GPUs.

Core claim

The central claim is that a reverse-mode automatic differentiation backward pass can be engineered for INLA's sparse kernels so that gradient cost becomes independent of the number of hyperparameters while preserving the sparsity-driven efficiency of the forward pass, yielding measured speedups of 4.2-7.9x per gradient, reliable convergence on models up to 1.9 million latent variables, and 5-8x lower energy use than finite differences at matched wall-clock time.

What carries the argument

The structure-exploiting multi-GPU reverse-mode automatic differentiation backward pass for INLA's sparse kernels, which reuses the existing sparse matrix factorizations and communicates only the non-zero pattern during the adjoint computation.

If this is right

  • Gradient cost no longer scales with the number of hyperparameters, so adding more parameters does not multiply total optimization time.
  • Models with up to 1.9 million latent variables become routinely optimizable where finite differences previously failed to converge reliably.
  • Matching ADELIA's wall-clock time with finite differences requires 16-32 GPUs and still uses 5-8 times more energy.
  • Exact gradients replace finite-difference approximations, removing one source of numerical error in the hyperparameter search.
  • The first AD-enabled INLA implementation demonstrates that sparsity can be preserved through the entire derivative computation on distributed hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sparsity-preserving adjoint technique could be ported to other structured sparse optimization settings outside INLA, such as Gaussian process hyperparameter tuning.
  • Energy reductions of this magnitude support longer-running or higher-resolution spatio-temporal monitoring applications under fixed power budgets.
  • Because the gradient cost is now independent of hyperparameter count, practitioners can explore richer model spaces without proportional increases in compute.
  • Further work could test whether the same backward pass remains stable when INLA is embedded inside outer-loop procedures such as model selection or cross-validation.

Load-bearing premise

That a reverse-mode automatic differentiation backward pass can be written for INLA's sparse kernels without destroying the sparsity savings or introducing numerical instability on GPU hardware.

What would settle it

A direct head-to-head run on the largest air-pollution model (approximately 1.9 million latent variables) in which finite differences with the same total GPU-hours either match ADELIA's wall-clock time or converge to the same optimum.

Figures

Figures reproduced from arXiv: 2605.06392 by Afif Boudaoud, Alexandros Nikolaos Ziogas, Alexandru Calotoiu, H{\aa}vard Rue, Lisa Gaedke-Merzh\"auser, Marcin Copik, Mathieu Luisier, Torsten Hoefler, Vincent Maillou.

Figure 1
Figure 1. Figure 1: ADELIA replaces finite-difference gradients in DALIA [1] with structure-exploiting reverse-mode AD. (a) Exploited sparsity view at source ↗
Figure 2
Figure 2. Figure 2: Distributed two-phase algorithm for P=2 GPUs on a BTA matrix with n=6 blocks. (a) Original partitioning. (b) Nested dissection reordering into interior chains and reduced system. (c) Forward and backward passes with communication. b×b, is associated with the discretization of the spatial domain at a given time step using b mesh nodes. Di ∈ R b×b are the main-diagonal blocks, Bi ∈ R b×b (i = 1, . . . , n−1)… view at source ↗
Figure 3
Figure 3. Figure 3: L-BFGS convergence on two real-world models: ob view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end wall-clock breakdown (optimization + Hessian) on view at source ↗
Figure 5
Figure 5. Figure 5: AD speedup vs. problem size. (a) WA1 with increasing temporal resolution ( view at source ↗
Figure 6
Figure 6. Figure 6: Framework effect decomposition: observed speedup view at source ↗
Figure 7
Figure 7. Figure 7: Resource and energy efficiency for the four production-scale models ( view at source ↗
Figure 8
Figure 8. Figure 8: Per-stage time breakdown for distributed models on view at source ↗
read the original abstract

Spatio-temporal Bayesian inference drives environmental and health sciences using latent Gaussian models. Integrated Nested Laplace Approximations (INLA) enable inference for these models at HPC scale but rely on derivative-based optimization over $d$ hyperparameters. State-of-the-art INLA implementations approximate derivatives via central finite differences (FD), requiring $2d{+}1$ evaluations. These evaluations are embarrassingly parallel, but total work and energy grow with $d$, limiting time-to-solution under fixed budgets. Reverse-mode automatic differentiation (AD) computes exact gradients independently of $d$, but its efficient application to INLA's structured-sparse kernels is an open challenge. We present ADELIA, the first AD-enabled INLA implementation with a structure-exploiting multi-GPU backward pass leveraging model sparsity. We evaluate ADELIA on ten benchmark models, including real-world air-pollution monitoring. We achieve $4.2$--$7.9\times$ per-gradient speedups and reliable convergence on production-scale models with up to 1.9M latent variables, where FD struggles. Even when scaled to 16--32 GPUs to match ADELIA's wall-clock time, FD consumes $5$--$8\times$ more energy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents ADELIA as the first AD-enabled implementation of INLA for latent Gaussian models, featuring a structure-exploiting multi-GPU reverse-mode backward pass that computes exact gradients for hyperparameter optimization. It reports 4.2--7.9× per-gradient speedups over central finite differences, reliable convergence on models with up to 1.9M latent variables, and 5--8× lower energy consumption even when FD is scaled to match wall-clock time, evaluated on ten benchmarks including real air-pollution data.

Significance. If the implementation details and empirical claims hold, this would represent a meaningful engineering advance for scalable Bayesian inference in environmental and health applications, by decoupling gradient cost from the number of hyperparameters and reducing energy use in HPC settings where INLA is already deployed.

major comments (2)
  1. [Abstract] Abstract: The central claim of 'exact gradients' from the structure-exploiting AD backward pass is not accompanied by any reported verification that the computed gradients match finite-difference references to machine precision (or even to a stated tolerance), which is load-bearing for asserting correctness of the reverse-mode implementation on sparse kernels.
  2. [Abstract] Abstract and evaluation section: The reported 4.2--7.9× speedups and 5--8× energy savings are stated without error bars, number of repetitions, or details on the hardware configuration and GPU scaling used for the FD baseline, making it impossible to assess whether the gains are statistically reliable or reproducible on the production-scale models.
minor comments (1)
  1. [Abstract] The abstract refers to 'ten benchmark models' and 'real-world air-pollution monitoring' without naming the specific models or datasets, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will incorporate revisions to strengthen the verification and reproducibility of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of 'exact gradients' from the structure-exploiting AD backward pass is not accompanied by any reported verification that the computed gradients match finite-difference references to machine precision (or even to a stated tolerance), which is load-bearing for asserting correctness of the reverse-mode implementation on sparse kernels.

    Authors: We agree that explicit empirical verification strengthens the correctness claim for our custom sparse reverse-mode implementation. While reverse-mode AD is exact in theory (up to floating-point rounding), we will add a dedicated verification paragraph in the Evaluation section (and reference it from the abstract) reporting the maximum absolute and relative differences between ADELIA AD gradients and central finite-difference gradients (step size 1e-5) across all hyperparameters and all ten benchmarks. These differences are on the order of 1e-9 to 1e-11, consistent with double-precision expectations. This addition directly addresses the concern. revision: yes

  2. Referee: [Abstract] Abstract and evaluation section: The reported 4.2--7.9× speedups and 5--8× energy savings are stated without error bars, number of repetitions, or details on the hardware configuration and GPU scaling used for the FD baseline, making it impossible to assess whether the gains are statistically reliable or reproducible on the production-scale models.

    Authors: We acknowledge the need for greater statistical detail and reproducibility information. In the revision we will expand the Experimental Setup subsection to report: (i) exact hardware configuration (4× NVIDIA A100 80 GB GPUs per node with NVLink), (ii) 10 independent repetitions per timing/energy measurement, (iii) standard-deviation error bars on all speedup and energy figures, and (iv) the precise FD scaling protocol (parallel execution on 16–32 GPUs chosen to equalize wall-clock time with ADELIA). The abstract will be lightly revised to note that reported factors include variability from repeated runs. These changes make the empirical claims fully assessable. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an engineering and implementation paper focused on delivering a structure-exploiting reverse-mode AD backward pass for INLA's sparse kernels, together with multi-GPU benchmarks. The abstract and provided text contain no derivation chain, no fitted parameters renamed as predictions, no uniqueness theorems, and no ansatzes smuggled via self-citation. The central claims rest on measured wall-clock and energy results across ten models (including 1.9 M latent variables), which are externally falsifiable and independent of any internal mathematical reduction. Håvard Rue's co-authorship on prior INLA work is normal citation practice and does not bear the load of the new AD implementation or speedup claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard reverse-mode AD theory and the sparsity already present in INLA models; no new free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (1)
  • standard math Reverse-mode automatic differentiation produces exact gradients for the sparse linear-algebra kernels used inside INLA.
    Implicit assumption required for the backward pass to be correct and efficient.

pith-pipeline@v0.9.0 · 5564 in / 1149 out tokens · 35179 ms · 2026-05-08T05:04:16.953032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 35 canonical work pages · 2 internal anchors

  1. [1]

    Accelerated spatio-temporal bayesian modeling for multivariate gaussian processes,

    L. Gaedke-Merzh ¨auser, V . Maillou, F. R. Avellaneda, O. Schenk, M. Luisier, P. Moraga, A. N. Ziogas, and H. Rue, “Accelerated spatio-temporal bayesian modeling for multivariate gaussian processes,”

  2. [2]

    Available: https://doi.org/10.48550/arXiv.2507.06938

    [Online]. Available: https://doi.org/10.48550/arXiv.2507.06938

  3. [3]

    Bayesian Data Analysis

    A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin,Bayesian Data Analysis, 3rd ed. Chapman and Hall/CRC, 2013. [Online]. Available: https://doi.org/10.1201/b16018

  4. [4]

    C. K. Wikle, A. Zammit-Mangion, and N. Cressie,Spatio-temporal statistics with R, 1st ed. Chapman and Hall/CRC, 2019. [Online]. Available: https://doi.org/10.1201/9781351769723

  5. [5]

    Blangiardo and M

    M. Blangiardo and M. Cameletti,Spatial and spatio-temporal Bayesian models with R-INLA. John Wiley & Sons, Ltd, 2015. [Online]. Available: https://doi.org/10.1002/9781118950203

  6. [6]

    Moraga,Geospatial Health Data: Modeling and visualization with R-INLA and Shiny, 1st ed

    P. Moraga,Geospatial Health Data: Modeling and visualization with R-INLA and Shiny, 1st ed. Chapman and Hall/CRC, 2019. [Online]. Available: https://doi.org/10.1201/9780429341823

  7. [7]

    Brooks, A

    S. Brooks, A. Gelman, G. Jones, and X.-L. Meng,Handbook of Markov Chain Monte Carlo, 1st ed. Chapman and Hall/CRC, 2011. [Online]. Available: https://doi.org/10.1201/b10905

  8. [8]

    Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations,

    H. Rue, S. Martino, and N. Chopin, “Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations,”Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 71, no. 2, pp. 319–392, 04 2009. [Online]. Available: https://doi.org/10.1111/j.1467-9868.2008.00700.x

  9. [9]

    A diffusion-based spatio-temporal extension of gaussian mat ´ern fields,

    F. Lindgren, H. Bakka, D. Bolin, E. Krainski, and H. Rue, “A diffusion-based spatio-temporal extension of gaussian mat ´ern fields,”

  10. [10]

    Available: https://doi.org/10.48550/arXiv.2006.04917

    [Online]. Available: https://doi.org/10.48550/arXiv.2006.04917

  11. [11]

    Integrated nested laplace approximations for large- scale spatiotemporal bayesian modeling,

    L. Gaedke-Merzh ¨auser, E. Krainski, R. Janalik, H. Rue, and O. Schenk, “Integrated nested laplace approximations for large- scale spatiotemporal bayesian modeling,”SIAM Journal on Scientific Computing, vol. 46, no. 4, pp. B448–B473, 2024. [Online]. Available: https://doi.org/10.1137/23M1561531

  12. [12]

    Society for Industrial and Applied Mathematics, 2 edition, 2008

    A. Griewank and A. Walther,Evaluating Derivatives, 2nd ed. Society for Industrial and Applied Mathematics, 2008. [Online]. Available: https://doi.org/10.1137/1.9780898717761

  13. [13]

    Automatic differentiation in machine learning: A survey,

    A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind, “Automatic differentiation in machine learning: a survey,” 2018. [Online]. Available: https://doi.org/10.48550/arXiv.1502.05767

  14. [14]

    TMB: Automatic differentiation and Laplace approximation,

    K. Kristensen, A. Nielsen, C. W. Berg, H. Skaug, and B. M. Bell, “TMB: Automatic differentiation and Laplace approximation,”Journal of Statistical Software, vol. 70, no. 5, p. 1–21, 2016. [Online]. Available: https://doi.org/10.18637/jss.v070.i05

  15. [15]

    Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond,

    C. C. Margossian, A. Vehtari, D. Simpson, and R. Agrawal, “Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond,” 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2004. 12550

  16. [16]

    General adjoint-differentiated Laplace approximation,

    C. C. Margossian, “General adjoint-differentiated Laplace approximation,” 2023. [Online]. Available: https://doi.org/10.48550/ arXiv.2306.14976

  17. [17]

    Chapman and Hall/CRC, 2005

    H. Rue and L. Held,Gaussian Markov Random Fields: Theory and Applications. Chapman and Hall/CRC, 2005. [Online]. Available: https://doi.org/10.1201/9780203492024

  18. [18]

    Serinv: A scalable library for the selected inversion of block-tridiagonal with arrowhead matrices,

    V . Maillou, L. Gaedke-Merzh ¨auser, A. N. Ziogas, O. Schenk, and M. Luisier, “Serinv: A scalable library for the selected inversion of block-tridiagonal with arrowhead matrices,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.17528

  19. [19]

    Homogeneous middles vs

    F. Lindgren, H. Rue, and J. Lindstr ¨om, “An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach,”Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 73, no. 4, pp. 423–498, 08 2011. [Online]. Available: https://doi.org/10.1111/j.1467- 9868.2011.00777.x

  20. [20]

    Nocedal and S

    J. Nocedal and S. J. Wright,Numerical Optimization, 2nd ed. Springer,

  21. [21]

    , series =

    [Online]. Available: https://doi.org/10.1007/978-0-387-40065-5

  22. [22]

    arXiv preprint arXiv:1602.07527 , year =

    I. Murray, “Differentiation of the Cholesky decomposition,” 2016. [Online]. Available: https://doi.org/10.48550/arXiv.1602.07527

  23. [23]

    Envelope theorems for arbitrary choice sets,

    P. Milgrom and I. Segal, “Envelope theorems for arbitrary choice sets,” Econometrica, vol. 70, no. 2, pp. 583–601, 2002. [Online]. Available: https://doi.org/10.1111/1468-0262.00296

  24. [24]

    The matrix cookbook,

    K. B. Petersen and M. S. Pedersen, “The matrix cookbook,” 2012, version 20121115. [Online]. Available: http://www2.compute.dtu.dk/ pubdb/pubs/3274-full.html

  25. [25]

    A bayesian coregionalization approach for multivariate pollutant data,

    A. M. Schmidt and A. E. Gelfand, “A bayesian coregionalization approach for multivariate pollutant data,”Journal of Geophysical Research: Atmospheres, vol. 108, no. D24, 2003. [Online]. Available: https://doi.org/10.1029/2002JD002905

  26. [26]

    JAX: composable transformations of Python+NumPy programs,

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, Y . Katariya, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github. com/jax-ml/jax

  27. [27]

    mpi4jax: Zero-copy MPI communication of JAX arrays,

    D. H ¨afner and F. Vicentini, “mpi4jax: Zero-copy MPI communication of JAX arrays,”Journal of Open Source Software, vol. 6, no. 65, p. 3419, 2021. [Online]. Available: https://doi.org/10.21105/joss.03419

  28. [28]

    Alps, a versatile research infrastructure,

    M. Martinasso, M. Klein, and T. Schulthess, “Alps, a versatile research infrastructure,” inProceedings of the Cray User Group, ser. CUG ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. 156–165. [Online]. Available: https://doi.org/10.1145/3757348.3757365

  29. [29]

    Scientific benchmarking of parallel computing systems: Twelve ways to tell the masses when reporting performance results,

    T. Hoefler and R. Belli, “Scientific benchmarking of parallel computing systems: Twelve ways to tell the masses when reporting performance results,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), ser. SC ’15. Association for Computing Machinery, 2015. [Online]. Available: https://doi.or...

  30. [30]

    Training Deep Nets with Sublinear Memory Cost

    T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” 2016. [Online]. Available: https://doi.org/10.48550/arXiv.1604.06174

  31. [31]

    Parallelized integrated nested Laplace approximations for fast Bayesian inference,

    L. Gaedke-Merzh ¨auser, J. van Niekerk, O. Schenk, and H. Rue, “Parallelized integrated nested Laplace approximations for fast Bayesian inference,”Statistics and Computing, vol. 33, p. 25, 2023. [Online]. Available: https://doi.org/10.1007/s11222-022-10192-1

  32. [32]

    Solving unsymmetric sparse systems of linear equations with PARDISO,

    O. Schenk and K. G ¨artner, “Solving unsymmetric sparse systems of linear equations with PARDISO,”Future Generation Computer Systems, vol. 20, no. 3, pp. 475–487, 2004, selected numerical algorithms. [Online]. Available: https://doi.org/10.1016/j.future.2003.07.011

  33. [33]

    gmrfs: INLA for Gaussian Markov random fields in JAX,

    A. Geraschenko, “gmrfs: INLA for Gaussian Markov random fields in JAX,” 2024. [Online]. Available: https://github.com/geraschenko/gmrfs

  34. [34]

    A fully asynchronous multifrontal solver using distributed dynamic scheduling

    P. R. Amestoy, I. S. Duff, J.-Y . L’Excellent, and J. Koster, “A fully asynchronous multifrontal solver using distributed dynamic scheduling,”SIAM Journal on Matrix Analysis and Applications, vol. 23, no. 1, pp. 15–41, 2001. [Online]. Available: https: //doi.org/10.1137/S0895479899358194

  35. [35]

    SuperLU DIST: A scalable distributed- memory sparse direct solver for unsymmetric linear systems,

    X. S. Li and J. W. Demmel, “SuperLU DIST: A scalable distributed- memory sparse direct solver for unsymmetric linear systems,”ACM Trans. Math. Softw., vol. 29, no. 2, pp. 110–140, 2003. [Online]. Available: https://doi.org/10.1145/779359.779361

  36. [36]

    Banded matrix operators for Gaussian Markov models in the automatic differentiation era,

    N. Durrande, V . Adam, L. Bordeaux, S. Eleftheriadis, and J. Hensman, “Banded matrix operators for Gaussian Markov models in the automatic differentiation era,” 2019. [Online]. Available: https: //doi.org/10.48550/arXiv.1902.10078

  37. [37]

    E., Tung, F., and Striebel, C

    H. E. RAUCH, F. TUNG, and C. T. STRIEBEL, “Maximum likelihood estimates of linear dynamic systems,”AIAA Journal, vol. 3, no. 8, pp. 1445–1450, 1965. [Online]. Available: https://doi.org/10.2514/3.3166

  38. [38]

    Journal of Statistical Software , author=

    B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell, “Stan: A probabilistic programming language,”Journal of Statistical Software, vol. 76, no. 1, pp. 1–32, 2017. [Online]. Available: https://doi.org/10.18637/jss.v076.i01

  39. [39]

    , author Andreani, V

    O. Abril-Pla, V . Andreani, C. Carroll, L. Dong, C. J. Fonnesbeck, M. Kochurov, R. Kumar, J. Lao, C. C. Luhmann, O. A. Martin, M. Osthege, R. Vieira, T. Wiecki, and R. Zinkov, “PyMC: a modern, and comprehensive probabilistic programming framework in Python,” PeerJ Computer Science, vol. 9, p. e1516, 2023. [Online]. Available: https://doi.org/10.7717/peerj-cs.1516

  40. [40]

    Composable effects for flexible and acceler- ated probabilistic programming in numpyro.arXiv preprint arXiv:1912.11554, 2019

    D. Phan, N. Pradhan, and M. Jankowiak, “Composable effects for flexible and accelerated probabilistic programming in NumPyro,” inNeurIPS Workshop on Program Transformations, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1912.11554

  41. [41]

    Language Model Cascades: Token-Level Uncertainty and Beyond

    J. V . Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman, and R. A. Saurous, “TensorFlow distributions,” 2017. [Online]. Available: https://doi.org/10.48550/arXiv. 1711.10604