arxiv: 2604.19530 · v2 · submitted 2026-04-21 · 💻 cs.LG · cs.CE· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention

Akash Yadav , Taiwo A. Adebiyi , Ruda Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:58 UTC · model grok-4.3

classification 💻 cs.LG cs.CEstat.ML

keywords stochastic attentionpredictive uncertaintycalibrationinference-time adaptationtransformer modelsfoundation modelsweather forecastingtime-series regression

0 comments

The pith

Stochastic Attention replaces softmax weights with normalized multinomial samples to generate calibrated predictive ensembles in transformers at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an inference-time modification to transformer attention that introduces controlled randomness by drawing normalized multinomial samples in place of deterministic softmax weights. A single concentration parameter governs the spread of these samples, and a simple univariate calibration objective tunes it so the resulting ensemble of predictions matches target uncertainty levels. This yields predictive uncertainty estimates without any retraining or architectural changes to the base model. The method is tested on scientific foundation models for weather forecasting, time-series prediction, and regression tasks, where it produces stronger native calibration and narrower prediction intervals than baselines while requiring far lower adaptation effort.

Core claim

Replacing softmax attention weights with normalized multinomial samples controlled by a concentration parameter, then tuning that parameter through a post-hoc calibration objective, produces predictive ensembles whose uncertainty aligns closely with targets, delivering the strongest native calibration and sharpest intervals among tested approaches at adaptation costs nearly three orders of magnitude lower than the next-best baseline.

What carries the argument

Stochastic Attention: an inference-time randomization of the attention mechanism that substitutes normalized multinomial samples for softmax weights, governed by one concentration parameter whose value is set by matching ensemble outputs to calibration targets.

Load-bearing premise

The uncertainty produced by these single-pass stochastic attention ensembles can be aligned to desired targets through univariate post-hoc calibration of the concentration parameter without harming the model's core predictive accuracy.

What would settle it

On a held-out scientific forecasting benchmark, Sample Average Stochastic Attention yields higher calibration error or wider intervals than a strong uncertainty-aware baseline while the tuned concentration degrades mean squared error relative to the deterministic model.

Figures

Figures reproduced from arXiv: 2604.19530 by Akash Yadav, Ruda Zhang, Taiwo A. Adebiyi.

**Figure 1.** Figure 1: Useful uncertainty in scientific foundation models is not a single property. (a) Microsoft ClimaX 72-hour weather forecasting illustrates the growing relevance of deep learning to scientific prediction, showing ERA5 ground truth, the corresponding forecast, and forecast bias. (b) Accuracy, calibration, sharpness, and cost define distinct and complementary axes for evaluating uncertainty quality. (c) A simp… view at source ↗

**Figure 2.** Figure 2: Native uncertainty on TimesFM (ETTh1, H=96). SA achieves Cov=61.5%, far closer to the 95% target than all baselines (IVON 15.6%, SWAG 16.7%, MultiSWAG 20.8%, Contextual Dropout 19.8%, HSA 38.5%). Baselines are not merely imperfect but systematically under-dispersed. Full results across all eight ETT configurations are in Appendix E. U-shaped histograms indicate under-dispersion or overconfidence, hump-sha… view at source ↗

**Figure 3.** Figure 3: PIT calibration on ClimaX (72-hour Z500). W1: Wasserstein-1 from uniform (lower=better). (a) SA approaches uniformity natively (W1=0.056); all baselines show U-shaped overconfidence (W1=0.10–0.22). (b) After temperature scaling, baselines improve to W1=0.04–0.06; SA’s native calibration is already comparable without correction. 6. Results on scientific forecasting 6.1. TimesFM: a diagnostic view Before tur… view at source ↗

**Figure 4.** Figure 4: Normalized PI-95 widths on ClimaX (SA=1.0; baselines temperature-scaled). SA is the sharpest overall; MultiSWAG is closest (≈1.02×) but requires 10 training runs. HSA’s accuracy loss forces intervals to ≈2.74× SA. Whiskers: P5/P95; diamond: mean. Contextual Dropout SWAG MultiSWAG IVON HSA TS SA Method 10 1 10 2 10 3 10 4 Time (minutes, log scale) [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Total method cost on ClimaX (log scale). SA uses only inference passes on a frozen backbone; MultiSWAG, the closest calibration competitor, requires 10× the training cost. Per-pass latency scales linearly with S (Appendix D.2). a multinomial draw and count normalization per query; the S=19 passes are embarrassingly parallel. On a V100- SXM2-32GB, SA’s BO calibration (140–580 evaluations) completes in 1–3 m… view at source ↗

**Figure 7.** Figure 7: Top: CRPS and Energy Score vs. ν on ClimaX (ν≥3). Both minimized at ν ∗=25, where SA ranks first among all methods. Bottom: PIT at ν=4 (W1=0.056) and ν=25 (W1=0.085). Calibration-first and score-optimal criteria select different but related operating points on the same SA family. different aspects of the predictive distribution: the calibration loss (Eq. 13) via its calibration-first criterion selects f… view at source ↗

**Figure 8.** Figure 8: Per-forward-pass latency on ClimaX as a function of ν. The BO-selected operating point (ν=4) lies in the low-overhead regime. D.3. BO Sensitivity and Robustness [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: shows the posterior distribution of β ∗= ln ν ∗ from the Bayesian GLM surrogate. The distribution concentrates in a narrow band (3.69–3.77), confirming reliable convergence of the ν-selection procedure. 3.7 3.72 3.74 3.76 - $ 0 5 10 15 Fr e q u e n c y - $ of posterior sample - $ via GLM [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: PIT at ν=4 (BO-optimized, W1=0.056) and ν=25 (CRPS-optimal, W1=0.085). The rightward shift reflects mild overcoverage from sharper per-sample predictions. IVON SWAG Multi-SWAG Contextual Dropout HSA Stochastic Attention 0 1 2 3 4 5 6 7 8 Normalized 25% interval width Temp.-scaled baselines Stochastic Attention Mean (a) PI-25. IVON SWAG Multi-SWAG Contextual Dropout HSA Stochastic Attention 0 1 2 3 4 5 6 … view at source ↗

**Figure 11.** Figure 11: ClimaX normalized interval widths at additional PI levels (SA=1.0; baselines temperature-scaled). The sharpness advantage of SA is consistent across all interval levels. intervals (0.34) while SWAG and MultiSWAG require the widest (0.47 and 0.50) [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Equal-coverage uncertainty on TimesFM (ETTh1, H=96). All baselines are rescaled to match SA’s empirical coverage (Cov=61.5%); SA remains native. At matched coverage, sharpness (mean PI width) becomes directly comparable. Compare with the native version in the main text ( [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: UCI Yacht. (a) Native PIT: SA achieves the most uniform distribution (W1=0.021). (b) PI-95 widths normalized by SA (baselines temperature-scaled); SA is the sharpest. MC Dropout SWAG MultiSWAG HSA IVON SA (ours) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Normalized 95% PI width (/ SA mean) Concrete SA (native) Temp.-scaled baselines Mean (a) Concrete. MC Dropout SWAG MultiSWAG HSA IVON SA (ours) 0 1 2 3 4 5 6 Normalized… view at source ↗

**Figure 14.** Figure 14: UCI per-dataset PI-95 sharpness (SA=1.0; baselines temperature-scaled). SA is the sharpest on 7 of 8 datasets; the exception is Naval, where HSA’s lower prediction error produces tighter scaled intervals. MC Dropout SWAG MultiSWAG HSA IVON SA (ours) 0 1 2 3 4 5 6 7 Normalized 95% PI width (/ SA mean) Wine SA (native) Temp.-scaled baselines Mean [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: UCI Wine: PI-95 sharpness (SA=1.0; baselines temperature-scaled). Equivalently, the predictive distribution is approximated by Monte Carlo averaging over stochastic subnetworks: p(y ⋆ | x ⋆ , D) ≈ 1 M X M m=1 p y ⋆ | x ⋆ , θ, d (m) . (20) 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

read the original abstract

Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a sample average lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on scientific foundation models for weather and time-series forecasting, as well as several regression tasks. Across benchmarks against uncertainty-aware baselines, we find that Sample Average Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable calibration, with adaptation costs nearly three orders of magnitude lower than the next-best baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a lightweight inference-time trick to add calibrated uncertainty to scientific transformers via multinomial attention sampling plus univariate tuning, but the non-linear blocks likely shift the ensemble mean away from the original predictions.

read the letter

The core idea is to replace softmax attention with normalized multinomial samples at inference time, controlled by one concentration parameter, then tune that parameter so the resulting ensemble matches target uncertainty. This avoids any retraining and keeps the overhead low. They report stronger native calibration and tighter intervals than baselines on weather, time-series, and regression tasks, at far lower cost than the next method. That efficiency angle is the clearest practical win if the numbers hold up. The sampling-plus-univariate-calibration combination does look distinct from the usual post-hoc or ensemble approaches cited in the abstract. The experiments claim clear wins on calibration sharpness and adaptation cost, which is worth checking in the full results. The main concern is that multinomial samples match the softmax expectation only at the attention layer. Every transformer block after that has non-linear FFNs, activations, and residuals, so the average output across stochastic passes will generally differ from the deterministic forward pass. The abstract gives no sign they measured or corrected for this shift in mean predictions, which could quietly degrade accuracy before calibration even starts. Without the full methods and tables it is hard to judge how large the effect is or whether the reported gains survive it. This is aimed at groups already running scientific foundation models and looking for cheap uncertainty without full retraining or heavy ensembles. A reader working on forecasting or regression UQ would get the most out of the efficiency comparison. The work shows enough coherent thinking and engagement with the problem to deserve referee time, even if the mean-bias issue needs direct checks in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Stochastic Attention, a lightweight inference-time modification for transformer-based scientific foundation models. It replaces softmax attention weights with normalized multinomial samples controlled by a single concentration parameter to generate predictive ensembles without retraining. A univariate post-hoc calibration tunes this parameter to match stochastic outputs to targets. On benchmarks for weather and time-series forecasting plus regression tasks, Sample Average Stochastic Attention is reported to achieve the strongest native calibration and sharpest prediction intervals at comparable calibration, with adaptation costs nearly three orders of magnitude lower than baselines.

Significance. If the claims hold after addressing the noted concerns, the approach would provide an efficient, low-cost method for adding calibrated uncertainty to existing large scientific foundation models without retraining, which could be impactful for high-stakes applications such as weather prediction.

major comments (2)

[Abstract and methods description of Stochastic Attention] Abstract and Stochastic Attention mechanism: although sampled attention weights have the same expectation as the original softmax weights, the non-linear transformer blocks (FFN layers, activations, residual additions) that follow attention imply that the expected output of the full stochastic forward pass need not equal the deterministic output. This can bias the ensemble mean relative to the original model's point predictions and potentially degrade core performance metrics even before calibration. The manuscript provides no explicit comparison of deterministic vs. ensemble-mean predictions or correction for this effect on the reported benchmarks.
[Experimental results section] Experimental evaluation: the abstract claims superior calibration and interval sharpness but the provided details lack full methods, benchmark specifications, statistical validation (e.g., error bars or significance tests), and verification that the univariate post-hoc tuning preserves predictive accuracy. These are load-bearing for the central claim of outperforming uncertainty-aware baselines without degradation.

minor comments (1)

[Abstract] Clarify the precise definition and distinction of 'Sample Average Stochastic Attention' relative to the general Stochastic Attention proposal, as the term appears only in the results claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and methods description of Stochastic Attention] Abstract and Stochastic Attention mechanism: although sampled attention weights have the same expectation as the original softmax weights, the non-linear transformer blocks (FFN layers, activations, residual additions) that follow attention imply that the expected output of the full stochastic forward pass need not equal the deterministic output. This can bias the ensemble mean relative to the original model's point predictions and potentially degrade core performance metrics even before calibration. The manuscript provides no explicit comparison of deterministic vs. ensemble-mean predictions or correction for this effect on the reported benchmarks.

Authors: We agree that the non-linearities following the attention layer mean the ensemble mean need not equal the deterministic output, and the current manuscript does not provide an explicit comparison. In the revised version we will add this analysis to the experimental results section, reporting the relative difference between deterministic predictions and sample-average ensemble means on every benchmark. This will quantify any bias and its effect on core metrics prior to calibration. revision: yes
Referee: [Experimental results section] Experimental evaluation: the abstract claims superior calibration and interval sharpness but the provided details lack full methods, benchmark specifications, statistical validation (e.g., error bars or significance tests), and verification that the univariate post-hoc tuning preserves predictive accuracy. These are load-bearing for the central claim of outperforming uncertainty-aware baselines without degradation.

Authors: We accept that the experimental section requires greater detail to substantiate the claims. The revised manuscript will expand the methods with complete benchmark specifications (datasets, sizes, preprocessing, and evaluation protocols). We will add error bars from repeated runs, statistical significance tests against baselines, and explicit tables comparing predictive accuracy (e.g., RMSE/MAE) of the original deterministic model, the uncalibrated stochastic ensembles, and the post-hoc calibrated version to confirm that tuning does not degrade core performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes Stochastic Attention as an inference-time modification replacing softmax with normalized multinomial samples controlled by one concentration parameter, then introduces an explicit post-hoc calibration objective to tune that parameter by matching outputs to targets. Claims rest on empirical benchmark comparisons against baselines rather than any first-principles derivation or mathematical reduction. No equations, uniqueness theorems, or self-citations appear in the provided text that would create a self-definitional loop or force a result by construction. The mechanism is defined independently before the univariate tuning step, and performance metrics are reported after evaluation, not as tautological outputs of the fit itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a new stochastic modification to attention and the assumption that its uncertainty can be calibrated via a single fitted parameter; no new entities are postulated.

free parameters (1)

concentration parameter
Single scalar controlling the stochasticity of multinomial attention samples; tuned post-hoc via the calibration objective to match targets.

axioms (1)

domain assumption Replacing softmax attention weights with normalized multinomial samples produces valid attention distributions that yield meaningful predictive ensembles
Invoked in the definition of the Stochastic Attention mechanism.

pith-pipeline@v0.9.0 · 5444 in / 1290 out tokens · 79188 ms · 2026-05-12T03:58:32.434662+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stochastic attention replaces that exact expectation by a finite-sample approximation drawn from the same categorical distribution... indexed by a concentration parameter ν... calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

E[eπt | πt] = πt, E[eot | πt, V] = ot... Cov(eπt | πt) = (1/ν)(diag(πt) − πt πt⊺)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

E., and M \"o llenhoff, T

Cong, B., Daheim, N., Shen, Y., Cremers, D., Yokota, R., Khan, M. E., and M \"o llenhoff, T. Variational low-rank adaptation using IVON . In NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability, 2024. URL https://openreview.net/forum?id=nRD5uZa2fe

work page 2024
[2]

A decoder-only foundation model for time-series forecasting

Das, A., Kong, W., Sen, R., and Zhou, Y. A decoder-only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=jn2iTJas6h

work page 2024
[3]

Dawid, A. P. Present position and potential developments: Some personal views statistical theory the prequential approach. Journal of the Royal Statistical Society: Series A (General), 147 0 (2): 0 278--290, 1984. doi:https://doi.org/10.2307/2981683. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.2307/2981683

work page doi:10.2307/2981683 1984
[4]

Contextual dropout: An efficient sample-dependent dropout module

Fan, X., Zhang, S., Tanwisuth, K., Qian, X., and Zhou, M. Contextual dropout: An efficient sample-dependent dropout module. In International Conference on Learning Representations, 2021 a . URL https://openreview.net/forum?id=ct8_a9h1M

work page 2021
[5]

Contextual dropout: An efficient sample-dependent dropout module

Fan, X., Zhang, S., Tanwisuth, K., Qian, X., and Zhou, M. Contextual dropout: An efficient sample-dependent dropout module. In International Conference on Learning Representations, 2021 b . URL https://openreview.net/forum?id=ct8_a9h1M

work page 2021
[6]

and Ghahramani, Z

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.\ 1050--1059, New York, New York, USA, 20--22 Jun 2016. PMLR. URL htt...

work page 2016
[7]

Verifying probabilistic forecasts: Calibration and sharpness

Gneiting, T., Balabdaoui, F., and Raftery, A. E. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69 0 (2): 0 243--268, 03 2007. ISSN 1369-7412. doi:10.1111/j.1467-9868.2007.00587.x. URL https://doi.org/10.1111/j.1467-9868.2007.00587.x. Originally presented at Workshop on Ensem...

work page doi:10.1111/j.1467-9868.2007.00587.x 2007
[8]

and Buettner, F

Gruber, S. and Buettner, F. Better uncertainty calibration via proper scores for classification and beyond. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 8618--8632. Curran Associates, Inc., 2022

work page 2022
[9]

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.\ 1321--1330. PMLR, 06--11 Aug 2017. URL https://proceedings.mlr.press/v70/guo17a.html

work page 2017
[10]

B., Kim, S., Lee, J., Kim, K

Heo, J., Lee, H. B., Kim, S., Lee, J., Kim, K. J., Yang, E., and Hwang, S. J. Uncertainty-aware attention for reliable interpretation and prediction. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://pro...

work page 2018
[11]

J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lo RA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

work page 2022
[12]

P., and Wilson, A

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. Averaging weights leads to wider optima and better generalization. In Globerson, A. and Silva, R. (eds.), Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, pp.\ 876--885, Monterey, California, USA, August 2018. AUAI Press

work page 2018
[13]

and Deshpande, S

Kuleshov, V. and Deshpande, S. Calibrated and sharp uncertainties in deep learning via density estimation. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 11683--11693. PMLR, 17--23 Jul...

work page 2022
[14]

Accurate uncertainties for deep learning using calibrated regression

Kuleshov, V., Fenner, N., and Ermon, S. Accurate uncertainties for deep learning using calibrated regression. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 2796--2804. PMLR, 10--15 Jul 2018. URL https://proceedings.mlr.press/v80/kuleshov18a.html

work page 2018
[15]

Simple and scalable predictive uncertainty estimation using deep ensembles

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neu...

work page 2017
[16]

J., Izmailov, P., Garipov, T., Vetrov, D

Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for B ayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, volume 32, pp.\ 13153--13164. Curran Associates, Inc., 2019

work page 2019
[17]

K., and Grover, A

Nguyen, T., Brandstetter, J., Kapoor, A., Gupta, J. K., and Grover, A. Climax: A foundation model for weather and climate. In ICML, pp.\ 25904--25938, 2023

work page 2023
[18]

Gaussian stochastic weight averaging for bayesian low-rank adaptation of large language models

Onal, E., Fl \"o ge, K., Caldwell, E., Sheverdin, A., and Fortuin, V. Gaussian stochastic weight averaging for bayesian low-rank adaptation of large language models. In Sixth Symposium on Advances in Approximate Bayesian Inference - Non Archival Track, 2024. URL https://openreview.net/forum?id=LZrCBQBCzl

work page 2024
[19]

M., Hubin, A., Immer, A., Karaletsos, T., Khan, M

Papamarkou, T., Skoularidou, M., Palla, K., Aitchison, L., Arbel, J., Dunson, D., Filippone, M., Fortuin, V., Hennig, P., Hern\' a ndez-Lobato, J. M., Hubin, A., Immer, A., Karaletsos, T., Khan, M. E., Kristiadi, A., Li, Y., Mandt, S., Nemeth, C., Osborne, M. A., Rudner, T. G. J., R\" u gamer, D., Teh, Y. W., Welling, M., Wilson, A. G., and Zhang, R. Posi...

work page 2024
[20]

Transformer uncertainty estimation with hierarchical stochastic attention

Pei, J., Wang, C., and Szarvas, G. Transformer uncertainty estimation with hierarchical stochastic attention. Proceedings of the AAAI Conference on Artificial Intelligence, 36 0 (10): 0 11147--11155, Jun. 2022. doi:10.1609/aaai.v36i10.21364. URL https://ojs.aaai.org/index.php/AAAI/article/view/21364

work page doi:10.1609/aaai.v36i10.21364 2022
[21]

M., Raoul, B

Shen, Y., Daheim, N., Cong, B., Nickl, P., Marconi, G. M., Raoul, B. C. E. M., Yokota, R., Gurevych, I., Cremers, D., Khan, M. E., and M\" o llenhoff, T. Variational learning is effective for large deep networks. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 44665--446...

work page 2024
[22]

Algorithmic Learning in a Random World

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic Learning in a Random World. Springer, Cham, first edition edition, 2005. ISBN 9783031066481. URL https://link.springer.com/book/10.1007/978-3-031-06649-8. First edition, 2005

work page doi:10.1007/978-3-031-06649-8 2005
[23]

Wilson, A. G. and Izmailov, P. Bayesian deep learning and a probabilistic perspective of generalization. In Advances in Neural Information Processing Systems, volume 33, pp.\ 4697--4708. Curran Associates, Inc., 2020

work page 2020
[24]

Show, attend and tell: Neural image caption generation with visual attention

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp.\ 2048--2057, Lille, Fr...

work page 2048
[25]

and Zhang, R

Yadav, A. and Zhang, R. Bayesian optimization under uncertainty for training a scale parameter in stochastic models, 2025

work page 2025
[26]

F., Yilmaz, E., Shi, S., and Tu, Z

Ye, F., Yang, M., Pang, J., Wang, L., Wong, D. F., Yilmaz, E., Shi, S., and Tu, Z. Benchmarking llms via uncertainty quantification. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 15356--15385. Curran Associates, Inc., 2024. doi:10.52202...

work page doi:10.52202/079017-0491 2024
[27]

maintain its level of performance under any circumstances

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., and Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 35 0 (12): 0 11106--11115, May 2021. doi:10.1609/aaai.v35i12.17325. URL https://ojs.aaai.org/index.php/AAAI/article/view/17325

work page doi:10.1609/aaai.v35i12.17325 2021
[28]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page