Anomaly-Informed Confidence Calibration for Vision-Based Safety Prediction

Gabriel Wagner; Ivan Ruchkin; Jiawen Wu; Zhenjiang Mao; Zhongzheng Zhang

arxiv: 2605.21109 · v1 · pith:ASKSN6JWnew · submitted 2026-05-20 · 💻 cs.RO

Anomaly-Informed Confidence Calibration for Vision-Based Safety Prediction

Zhenjiang Mao , Jiawen Wu , Gabriel Wagner , Zhongzheng Zhang , Ivan Ruchkin This is my paper

Pith reviewed 2026-05-21 04:21 UTC · model grok-4.3

classification 💻 cs.RO

keywords anomaly detectionconfidence calibrationvision-based controldistribution shiftsepistemic uncertaintyautonomous racingonline calibrationsafety prediction

0 comments

The pith

Fusing perceptual reconstruction errors with dynamics uncertainty scores calibrates vision-based safety predictions under unseen distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that existing anomaly scores for vision-based controllers overlook dynamics problems such as actuation bias or latency, even when camera images appear normal. It proposes an online calibration method that combines a perceptual score based on reconstruction error with a dynamics score drawn from epistemic uncertainty and control statistics. This fused signal drives a lightweight temperature-scaling step that lowers overconfidence only when anomalies are detected. A sympathetic reader would care because better-calibrated safety predictions could prevent over-trust in controllers facing real-world changes without requiring any model retraining.

Core claim

The Anomaly-Informed Online Calibration fuses a perceptual anomaly score from reconstruction error with a dynamics anomaly score from epistemic uncertainty and control-stream statistics inside a world model. Using these scores, a temperature-scaling calibrator performs test-time augmentation to reduce overconfidence selectively under shift while leaving nominal-condition performance unchanged. On a physical DonkeyCar tested with four real-world anomaly protocols (darkness, blur, actuation bias, processing latency) never seen in training, the method lowers average expected calibration error from 0.184 to 0.116.

What carries the argument

Anomaly-Informed Online Calibration, which fuses perceptual reconstruction error and dynamics epistemic uncertainty from a world model to adjust predictor temperature at test time.

If this is right

Vision-based safety predictors can receive reliable confidence estimates without any component being retrained when new anomalies appear.
The same fusion of perceptual and dynamics scores applies directly to other physical platforms that use camera images for control.
Nominal performance stays intact because the calibrator acts only when the fused anomaly signal rises above baseline levels.
Four specific anomaly types—darkness, blur, actuation bias, and processing latency—are each handled by the same online procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The perception-dynamics gap identified here may appear in other control domains such as drone navigation or robotic manipulation whenever visual inputs remain plausible while physical behavior degrades.
Replacing the world model with a learned dynamics predictor trained on more diverse shifts could extend the method to environments where the current model becomes unreliable.
The selective nature of the calibration suggests it could be combined with uncertainty-aware planning to trigger safer fallback behaviors only when both perception and dynamics signals agree.

Load-bearing premise

The world model that supplies the dynamics anomaly score remains accurate and does not itself produce misleading signals when the input distribution shifts.

What would settle it

Run the same four anomaly protocols on the DonkeyCar while measuring whether the fused score still produces a lower expected calibration error than the best baseline; failure to show the 0.116 error or worse performance than the baseline would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 2605.21109 by Gabriel Wagner, Ivan Ruchkin, Jiawen Wu, Zhenjiang Mao, Zhongzheng Zhang.

**Figure 2.** Figure 2: Images of in-distribution data and four anomalies on [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Approach overview. At each time step i, the world model extracts a perception score ρi and a dynamics score δi from its internal inference errors. The safety predictor g is evaluated under test-time augmentation to produce a TTA-averaged prediction p¯i+k, which is then calibrated via anomaly-conditioned temperature scaling to yield the final confidence p˜i+k [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Baseline comparison: ECE (↓) vs. prediction horizon K under four OOD protocols. Our method (TTA+ρ+δ) consistently achieves the lowest calibration error. perception score (Eq. 3): δi = clip[0,1] dMaha(f Kmax i , µin, Σin) − µδ α σδ ! , α = 2, (4) where µin and Σin are the sample mean and covariance of {f Kmax i }i∈Dcal in , and µδ, σδ are the mean and standard deviation of the Mahalanobis distance on Dcal… view at source ↗

**Figure 5.** Figure 5: Alternative calibration metrics vs. prediction hori [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Reliability diagrams aggregated over all four OOD [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Reliable confidence estimates are important for safely deploying vision-based controllers in autonomous racing, where safety predictions must be derived from camera images, yet modern predictors become dangerously overconfident under test-time distribution shifts. We identify a critical perception-dynamics gap in existing anomaly signals: widely used scores, such as autoencoder reconstruction error, capture visual corruptions but miss dynamics anomalies (e.g., actuation bias, latency), where images remain plausible while the trajectory degrades. To address this, we propose an Anomaly-Informed Online Calibration approach that, without retraining any model component, fuses two complementary anomaly scores extracted from a world model: a perceptual score from reconstruction error and a dynamics score from epistemic uncertainty and control-stream statistics. Based on these fused scores, a lightweight temperature-scaling calibrator leverages test-time augmentation to selectively reduce overconfidence under shift while preserving nominal-condition performance. Experiments on a physical DonkeyCar under four real-world anomaly protocols unseen during training (darkness, blur, actuation bias, processing latency) reduce average expected calibration error from 0.184 to 0.116, a 37% improvement over the best baseline, without modifying the base safety predictor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets a 37% ECE drop on physical hardware by fusing perceptual reconstruction error with dynamics epistemic uncertainty for selective online calibration under real shifts.

read the letter

The main thing to know is that this work reports a clear numerical gain on a real DonkeyCar by combining two anomaly signals to decide when to apply test-time temperature scaling. They point out that visual reconstruction error catches image corruptions but misses actuation bias and latency, where pictures stay plausible while control degrades. The fix is to add a dynamics score from epistemic uncertainty in a world model plus control-stream stats, fuse the two, and use that to trigger calibration only under shift while leaving nominal performance alone. The four anomaly protocols (darkness, blur, actuation bias, latency) are all unseen in training, and they measure a drop from 0.184 to 0.116 average ECE against the best baseline. That physical validation is the strongest part; it moves beyond simulation and shows the idea can run without retraining the safety predictor. The approach stays empirical and avoids circular claims, which helps. On the softer side, the abstract does not spell out the exact fusion weights or how the baselines were reimplemented, so it is hard to judge robustness or whether the 37% holds under different random seeds. The stress-test worry about the world model staying informative on actuation and latency shifts is reasonable to check, but the reported gains on those exact cases suggest the fused signal did trigger useful scaling in their runs. If the full methods show the dynamics score actually rises when needed, the central claim stands; otherwise the improvement could shrink. This is useful for people working on deployable vision-based controllers who already have a world model and want a lightweight way to handle distribution shift at test time. It is not a theoretical advance but has enough concrete hardware evidence to deserve a serious referee who can press on the implementation details and statistical reporting.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Anomaly-Informed Online Calibration for vision-based safety prediction in autonomous racing. Without retraining, it fuses a perceptual anomaly score (reconstruction error) with a dynamics score (epistemic uncertainty plus control-stream statistics) extracted from a world model; the fused score then drives test-time temperature scaling to reduce overconfidence under distribution shift. On a physical DonkeyCar, the method lowers average expected calibration error from 0.184 to 0.116 (37 % improvement) across four real-world anomaly protocols unseen in training: darkness, blur, actuation bias, and processing latency.

Significance. If the result holds, the work supplies a practical, training-free mechanism for closing the perception-dynamics gap in anomaly detection for safety-critical vision controllers. The hardware validation with multiple distinct, physically realized shifts is a concrete strength that increases relevance for deployment.

major comments (2)

[Abstract and §3 (Method)] The central claim that the fused anomaly score correctly triggers calibration under actuation bias and processing latency rests on the dynamics component (epistemic uncertainty and control-stream statistics) increasing meaningfully when images remain visually plausible. Because the world model is trained only on nominal trajectories, it is unclear whether epistemic uncertainty rises under these shifts; if it remains low or miscalibrated, the fusion under-detects the anomaly and the reported ECE reduction cannot be attributed to the proposed method. Please add an ablation or per-anomaly breakdown of the dynamics score values and their correlation with calibration improvement.
[Experiments section] Table or figure reporting the 0.184-to-0.116 ECE reduction (and the 37 % figure) does not state the exact baseline implementations, the number of runs, or any statistical test for the improvement. Without these, it is impossible to judge whether the gain is robust or reproducible, which directly affects the strength of the empirical contribution.

minor comments (1)

[§3] Define the precise fusion rule (e.g., weighted sum, product, or learned combination) for the perceptual and dynamics scores and state how the temperature is selected from the test-time augmentation ensemble.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments raise valid points about clarifying the contribution of the dynamics anomaly component and improving the reporting of experimental details for reproducibility. We address each major comment below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3 (Method)] The central claim that the fused anomaly score correctly triggers calibration under actuation bias and processing latency rests on the dynamics component (epistemic uncertainty and control-stream statistics) increasing meaningfully when images remain visually plausible. Please add an ablation or per-anomaly breakdown of the dynamics score values and their correlation with calibration improvement.

Authors: We agree that explicit evidence for the dynamics score's behavior under non-visual shifts is important to substantiate the fusion mechanism. The world model, trained exclusively on nominal trajectories, produces elevated epistemic uncertainty when control inputs lead to trajectory deviations that are inconsistent with learned dynamics, even if the corresponding images appear plausible. In the revised manuscript we have added a per-anomaly breakdown (new Table 3 and accompanying text in §4.3) that reports mean dynamics scores for each of the four anomaly protocols together with their Pearson correlation to the observed per-anomaly ECE reductions. The added analysis shows that the dynamics score rises substantially for actuation bias and latency (while the perceptual score remains near nominal levels), and that this increase accounts for the majority of the calibration gain in those cases. We believe this directly addresses the concern and strengthens the attribution of the reported ECE improvement to the proposed method. revision: yes
Referee: [Experiments section] Table or figure reporting the 0.184-to-0.116 ECE reduction (and the 37 % figure) does not state the exact baseline implementations, the number of runs, or any statistical test for the improvement.

Authors: We acknowledge that the original presentation omitted several details required for full reproducibility assessment. In the revised version we have expanded the caption of the primary results table (Table 2) and the corresponding paragraph in §4.2 to (i) list the precise baseline implementations (standard temperature scaling, entropy-based scaling, and Monte-Carlo dropout calibration, each applied without anomaly information), (ii) state that all metrics are averaged over 5 independent physical runs with standard deviation reported, and (iii) include a paired t-test confirming that the ECE reduction is statistically significant (p < 0.05). These additions are now present in both the main text and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation on physical hardware

full rationale

The paper proposes an anomaly-informed online calibration method that fuses perceptual reconstruction error with dynamics epistemic uncertainty from a world model, then applies selective temperature scaling at test time. All performance claims (37% ECE reduction from 0.184 to 0.116) are obtained from direct measurement on a physical DonkeyCar under four unseen real-world anomaly protocols. No equations, fitted parameters, or self-citations are used to derive the reported improvement; the result is an external empirical outcome rather than a quantity defined by construction from the calibration procedure itself. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method depends on a pre-trained world model whose uncertainty estimates are treated as reliable indicators of dynamics anomalies; no new entities are postulated and no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption A world model trained on nominal data produces epistemic uncertainty that meaningfully signals dynamics anomalies such as actuation bias or latency.
Invoked when extracting the dynamics score from epistemic uncertainty and control-stream statistics.

pith-pipeline@v0.9.0 · 5743 in / 1233 out tokens · 35410 ms · 2026-05-21T04:21:42.054192+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

A simple unified framework for detecting out-of-distribution samples and adversarial attacks,

K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting out-of-distribution samples and adversarial attacks,” inNeurIPS, vol. 31, Curran Associates, Inc., 2018

work page 2018
[2]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inICML, 2017, pp. 1321–1330

work page 2017
[3]

Can you trust your model’s uncer- tainty? Evaluating predictive uncertainty under dataset shift,

Y . Ovadia et al., “Can you trust your model’s uncer- tainty? Evaluating predictive uncertainty under dataset shift,” inNeurIPS, vol. 32, 2019

work page 2019
[4]

Post-hoc uncertainty calibration for domain drift scenarios,

C. Tomani, S. Gruber, M. E. Erdem, D. Cremers, and F. Buettner, “Post-hoc uncertainty calibration for domain drift scenarios,” inCVPR, 2021, pp. 10 124– 10 132

work page 2021
[5]

Dropout as a Bayesian ap- proximation: Representing model uncertainty in deep learning,

Y . Gal and Z. Ghahramani, “Dropout as a Bayesian ap- proximation: Representing model uncertainty in deep learning,” inICML, 2016, pp. 1050–1059

work page 2016
[6]

Simple and scalable predictive uncertainty estimation using deep ensembles,

B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” inNeurIPS, vol. 30, 2017

work page 2017
[7]

Ex- ploring covariate and concept shift for detection and calibration of out-of-distribution data,

J. Tian, Y .-C. Hsu, Y . Shen, H. Jin, and Z. Kira, “Ex- ploring covariate and concept shift for detection and calibration of out-of-distribution data,”arXiv preprint arXiv:2110.15231, 2021

work page arXiv 2021
[8]

Robust calibration with multi-domain temperature scaling,

Y . Yu, S. Bates, Y . Ma, and M. Jordan, “Robust calibration with multi-domain temperature scaling,” NeurIPS, vol. 35, pp. 27 510–27 523, 2022

work page 2022
[9]

Recurrent world models facilitate policy evolution,

D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” inNeurIPS, vol. 31, 2018

work page 2018
[10]

Benchmarking neu- ral network robustness to common corruptions and perturbations,

D. Hendrycks and T. Dietterich, “Benchmarking neu- ral network robustness to common corruptions and perturbations,” inICLR, 2019

work page 2019
[11]

Variational autoencoder based anomaly detection using reconstruction probability,

J. An and S. Cho, “Variational autoencoder based anomaly detection using reconstruction probability,” Special Lecture on IE, vol. 2, no. 1, pp. 1–18, 2015

work page 2015
[12]

An introduction to ROC analysis

T. Fawcett, “An introduction to roc analysis,”Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006, ROC Analysis in Pattern Recognition,ISSN: 0167- 8655.DOI:10.1016/j.patrec.2005.10.010

work page doi:10.1016/j.patrec.2005.10.010 2006
[13]

Test time augmentation meets post-hoc calibration: Uncertainty quantification under real-world conditions,

A. Hekler, T. J. Brinker, and F. Buettner, “Test time augmentation meets post-hoc calibration: Uncertainty quantification under real-world conditions,” inPro- ceedings of the AAAI Conference on Artificial Intel- ligence, vol. 37, Jun. 2023, pp. 14 856–14 864.DOI: 10.1609/aaai.v37i12.26735

work page doi:10.1609/aaai.v37i12.26735 2023
[14]

Deep anomaly detection with outlier exposure,

D. Hendrycks, M. Mazeika, and T. G. Dietterich, “Deep anomaly detection with outlier exposure,” in ICLR, 2019

work page 2019
[15]

Learning to drive (L2D) as a low-cost benchmark for real-world reinforcement learning,

A. Viitala, R. Boney, Y . Zhao, A. Ilin, and J. Kannala, “Learning to drive (L2D) as a low-cost benchmark for real-world reinforcement learning,” inICAR, IEEE, 2021, pp. 275–281

work page 2021
[16]

A baseline for detect- ing misclassified and out-of-distribution examples in neural networks,

D. Hendrycks and K. Gimpel, “A baseline for detect- ing misclassified and out-of-distribution examples in neural networks,” inICLR, 2017

work page 2017
[17]

Block selection method for using feature norm in out-of- distribution detection,

Y . Yu, S. Shin, S. Lee, C. Jun, and K. Lee, “Block selection method for using feature norm in out-of- distribution detection,” inCVPR, 2023, pp. 15 701– 15 711

work page 2023
[18]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,

J. C. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” inAdvances in Large Margin Classifiers, MIT Press, 1999, pp. 61–74

work page 1999
[19]

Transforming classifier scores into accurate multiclass probability estimates,

B. Zadrozny and C. Elkan, “Transforming classifier scores into accurate multiclass probability estimates,” inKDD, 2002, pp. 694–699

work page 2002
[20]

Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers,

B. Zadrozny and C. Elkan, “Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers,” inICML, 2001, pp. 609–616

work page 2001
[21]

Beyond in-domain scenarios: Robust density-aware calibration,

C. Tomani, F. K. Waseda, Y . Shen, and D. Cremers, “Beyond in-domain scenarios: Robust density-aware calibration,” inICML, 2023, pp. 34 344–34 368

work page 2023
[22]

Measuring calibration in deep learning,

J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran, “Measuring calibration in deep learning,” in CVPRW, 2019

work page 2019
[23]

VOS: Learning what you don’t know by virtual outlier synthesis,

X. Du, Z. Wang, M. Cai, and Y . Li, “VOS: Learning what you don’t know by virtual outlier synthesis,” in ICLR, 2022

work page 2022
[24]

React: Out-of-distribution detection with rectified activations,

Y . Sun, C. Guo, and Y . Li, “React: Out-of-distribution detection with rectified activations,” inNeurIPS, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34, Curran Associates, Inc., 2021, pp. 144–157

work page 2021
[25]

How safe am i given what i see? calibrated prediction of safety chances for image-controlled autonomy,

Z. Mao, C. Sobolewski, and I. Ruchkin, “How safe am i given what i see? calibrated prediction of safety chances for image-controlled autonomy,” inProc. of the 6th Annual Learning for Dynamics and Control Conference, vol. 242, PMLR, 2024, pp. 1370–1387

work page 2024
[26]

Misbehaviour prediction for autonomous driving sys- tems,

A. Stocco, M. Weiss, M. Calzana, and P. Tonella, “Misbehaviour prediction for autonomous driving sys- tems,” inICSE, 2020, pp. 359–371.DOI:10.1145/ 3377811.3380353

work page arXiv 2020
[27]

Mas- tering diverse control tasks through world models,

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mas- tering diverse control tasks through world models,” Nature, vol. 640, pp. 647–653, 2025

work page 2025
[28]

End to End Learning for Self-Driving Cars

M. Bojarski et al., “End to end learning for self-driving cars,”arXiv preprint arXiv:1604.07316, 2016. APPENDIX A Training and Implementation Details Table V lists the training hyperparameters for all pipeline components. The world model consists of a convolutional V AE operating on64×64images and a ConvLSTM predictor that autoregressively rolls out laten...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

A simple unified framework for detecting out-of-distribution samples and adversarial attacks,

K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting out-of-distribution samples and adversarial attacks,” inNeurIPS, vol. 31, Curran Associates, Inc., 2018

work page 2018

[2] [2]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inICML, 2017, pp. 1321–1330

work page 2017

[3] [3]

Can you trust your model’s uncer- tainty? Evaluating predictive uncertainty under dataset shift,

Y . Ovadia et al., “Can you trust your model’s uncer- tainty? Evaluating predictive uncertainty under dataset shift,” inNeurIPS, vol. 32, 2019

work page 2019

[4] [4]

Post-hoc uncertainty calibration for domain drift scenarios,

C. Tomani, S. Gruber, M. E. Erdem, D. Cremers, and F. Buettner, “Post-hoc uncertainty calibration for domain drift scenarios,” inCVPR, 2021, pp. 10 124– 10 132

work page 2021

[5] [5]

Dropout as a Bayesian ap- proximation: Representing model uncertainty in deep learning,

Y . Gal and Z. Ghahramani, “Dropout as a Bayesian ap- proximation: Representing model uncertainty in deep learning,” inICML, 2016, pp. 1050–1059

work page 2016

[6] [6]

Simple and scalable predictive uncertainty estimation using deep ensembles,

B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” inNeurIPS, vol. 30, 2017

work page 2017

[7] [7]

Ex- ploring covariate and concept shift for detection and calibration of out-of-distribution data,

J. Tian, Y .-C. Hsu, Y . Shen, H. Jin, and Z. Kira, “Ex- ploring covariate and concept shift for detection and calibration of out-of-distribution data,”arXiv preprint arXiv:2110.15231, 2021

work page arXiv 2021

[8] [8]

Robust calibration with multi-domain temperature scaling,

Y . Yu, S. Bates, Y . Ma, and M. Jordan, “Robust calibration with multi-domain temperature scaling,” NeurIPS, vol. 35, pp. 27 510–27 523, 2022

work page 2022

[9] [9]

Recurrent world models facilitate policy evolution,

D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” inNeurIPS, vol. 31, 2018

work page 2018

[10] [10]

Benchmarking neu- ral network robustness to common corruptions and perturbations,

D. Hendrycks and T. Dietterich, “Benchmarking neu- ral network robustness to common corruptions and perturbations,” inICLR, 2019

work page 2019

[11] [11]

Variational autoencoder based anomaly detection using reconstruction probability,

J. An and S. Cho, “Variational autoencoder based anomaly detection using reconstruction probability,” Special Lecture on IE, vol. 2, no. 1, pp. 1–18, 2015

work page 2015

[12] [12]

An introduction to ROC analysis

T. Fawcett, “An introduction to roc analysis,”Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006, ROC Analysis in Pattern Recognition,ISSN: 0167- 8655.DOI:10.1016/j.patrec.2005.10.010

work page doi:10.1016/j.patrec.2005.10.010 2006

[13] [13]

Test time augmentation meets post-hoc calibration: Uncertainty quantification under real-world conditions,

A. Hekler, T. J. Brinker, and F. Buettner, “Test time augmentation meets post-hoc calibration: Uncertainty quantification under real-world conditions,” inPro- ceedings of the AAAI Conference on Artificial Intel- ligence, vol. 37, Jun. 2023, pp. 14 856–14 864.DOI: 10.1609/aaai.v37i12.26735

work page doi:10.1609/aaai.v37i12.26735 2023

[14] [14]

Deep anomaly detection with outlier exposure,

D. Hendrycks, M. Mazeika, and T. G. Dietterich, “Deep anomaly detection with outlier exposure,” in ICLR, 2019

work page 2019

[15] [15]

Learning to drive (L2D) as a low-cost benchmark for real-world reinforcement learning,

A. Viitala, R. Boney, Y . Zhao, A. Ilin, and J. Kannala, “Learning to drive (L2D) as a low-cost benchmark for real-world reinforcement learning,” inICAR, IEEE, 2021, pp. 275–281

work page 2021

[16] [16]

A baseline for detect- ing misclassified and out-of-distribution examples in neural networks,

D. Hendrycks and K. Gimpel, “A baseline for detect- ing misclassified and out-of-distribution examples in neural networks,” inICLR, 2017

work page 2017

[17] [17]

Block selection method for using feature norm in out-of- distribution detection,

Y . Yu, S. Shin, S. Lee, C. Jun, and K. Lee, “Block selection method for using feature norm in out-of- distribution detection,” inCVPR, 2023, pp. 15 701– 15 711

work page 2023

[18] [18]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,

J. C. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” inAdvances in Large Margin Classifiers, MIT Press, 1999, pp. 61–74

work page 1999

[19] [19]

Transforming classifier scores into accurate multiclass probability estimates,

B. Zadrozny and C. Elkan, “Transforming classifier scores into accurate multiclass probability estimates,” inKDD, 2002, pp. 694–699

work page 2002

[20] [20]

Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers,

B. Zadrozny and C. Elkan, “Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers,” inICML, 2001, pp. 609–616

work page 2001

[21] [21]

Beyond in-domain scenarios: Robust density-aware calibration,

C. Tomani, F. K. Waseda, Y . Shen, and D. Cremers, “Beyond in-domain scenarios: Robust density-aware calibration,” inICML, 2023, pp. 34 344–34 368

work page 2023

[22] [22]

Measuring calibration in deep learning,

J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran, “Measuring calibration in deep learning,” in CVPRW, 2019

work page 2019

[23] [23]

VOS: Learning what you don’t know by virtual outlier synthesis,

X. Du, Z. Wang, M. Cai, and Y . Li, “VOS: Learning what you don’t know by virtual outlier synthesis,” in ICLR, 2022

work page 2022

[24] [24]

React: Out-of-distribution detection with rectified activations,

Y . Sun, C. Guo, and Y . Li, “React: Out-of-distribution detection with rectified activations,” inNeurIPS, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34, Curran Associates, Inc., 2021, pp. 144–157

work page 2021

[25] [25]

How safe am i given what i see? calibrated prediction of safety chances for image-controlled autonomy,

Z. Mao, C. Sobolewski, and I. Ruchkin, “How safe am i given what i see? calibrated prediction of safety chances for image-controlled autonomy,” inProc. of the 6th Annual Learning for Dynamics and Control Conference, vol. 242, PMLR, 2024, pp. 1370–1387

work page 2024

[26] [26]

Misbehaviour prediction for autonomous driving sys- tems,

A. Stocco, M. Weiss, M. Calzana, and P. Tonella, “Misbehaviour prediction for autonomous driving sys- tems,” inICSE, 2020, pp. 359–371.DOI:10.1145/ 3377811.3380353

work page arXiv 2020

[27] [27]

Mas- tering diverse control tasks through world models,

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mas- tering diverse control tasks through world models,” Nature, vol. 640, pp. 647–653, 2025

work page 2025

[28] [28]

End to End Learning for Self-Driving Cars

M. Bojarski et al., “End to end learning for self-driving cars,”arXiv preprint arXiv:1604.07316, 2016. APPENDIX A Training and Implementation Details Table V lists the training hyperparameters for all pipeline components. The world model consists of a convolutional V AE operating on64×64images and a ConvLSTM predictor that autoregressively rolls out laten...

work page internal anchor Pith review Pith/arXiv arXiv 2016