arxiv: 2604.20472 · v1 · submitted 2026-04-22 · 💻 cs.RO · cs.LG

Recognition: unknown

Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

Shelly Francis-Meretzki , Mirco Mutti , Yaniv Romano , Aviv Tamar

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:56 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords vision-language-action modelsuncertainty calibrationtemporal difference learningsequential tasksBrier scorevalue functionrobotics

0 comments

The pith

The sequential Brier score risk minimizer equals the value function of a VLA policy for binary episodic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a direct link between uncertainty calibration and reinforcement learning value functions in sequential decision tasks. It extends the Brier score to handle partial trajectories in episodic settings where success is known only at the end. For binary outcomes this extension makes the optimal calibration target identical to the policy's value function, so temporal difference learning can be used to adjust confidence estimates over time. Experiments on simulated and real robot data show improved calibration and performance, and reveal that single-step action probabilities become reliable uncertainty sources under this calibration.

Core claim

We introduce a sequential extension of the Brier score and show that, for binary outcomes, its risk minimizer coincides with the VLA policy's value function. This connection bridges uncertainty calibration and reinforcement learning, enabling the use of temporal-difference (TD) value estimation as a principled calibration mechanism over time. Empirically, TD calibration improves performance relative to the state-of-the-art on simulated and real-robot data, and single-step action probabilities yield competitive uncertainty estimates.

What carries the argument

Sequential Brier score whose risk minimizer equals the policy value function, allowing TD estimation to serve as calibration.

If this is right

TD calibration improves uncertainty estimates and task performance on both simulated and real-robot VLA data.
Single-step action probabilities from a TD-calibrated VLA become competitive sources of uncertainty.
Calibration and value estimation can be treated as the same optimization problem in binary sequential settings.
Partial trajectories suffice for calibration because the sequential risk minimizer matches the value function.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same TD mechanism could be tested on non-episodic or non-binary tasks to see whether the equivalence generalizes.
Other calibration losses might admit similar value-function interpretations, opening a route to RL-style calibration in broader sequential models.
If the equivalence holds, existing value-function approximators in robotics could be repurposed directly as calibrated uncertainty models without extra heads.

Load-bearing premise

Tasks are episodic and outcomes are binary success signals observed only at episode termination.

What would settle it

A binary episodic task where optimizing the sequential Brier score produces probabilities that differ from the value function computed by TD learning, or where TD calibration shows no improvement over non-TD baselines.

Figures

Figures reproduced from arXiv: 2604.20472 by Aviv Tamar, Mirco Mutti, Shelly Francis-Meretzki, Yaniv Romano.

**Figure 1.** Figure 1: Sequential Brier scores across benchmarks. Sequential Brier score (lower is better) on an unseen validation set averaged over 21 random seeds (train/validation task splits). To compare calibration across rollouts with different lengths, we report Brier score over time quantiles. Each subplot corresponds to a (VLA model, benchmark) pair. Success prediction methods are based on sequences of features or actio… view at source ↗

**Figure 2.** Figure 2: Two-step MDP from Example 4.1. success predictor by its two-component decomposition. Let Ft = f(ht) denote the random event that the model predicts a particular value at time t, and let η(Ft) = P(Y (hT ) = 1 | Ft), the success probability conditioned on the prediction at time t. The following decomposition holds: BSseq(f, t)=E π [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: ROC-AUC vs Brier score over all learned baselines in all benchmarks at the minimum rollout length. Points are grouped by method and split, with a dashed linear fit; the Spearman correlation is ρ = −0.686 which indicates high negative correlation. task, we compute the minimum rollout length and evaluate ROC-AUC using the maximum prediction up to this timestep [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: shows that RNN methods with top-10 probabilities gain significant improvement and outperform the baseline VLA policy, validating the use of learned scoring function (fθ) for action selection. Notably, the RNN method trained with TDQC achieves the highest overall performance, reaching 55% success rate on average, an improvement of 13% over the regular baseline. While the BCE loss variant also improves upon… view at source ↗

**Figure 5.** Figure 5: Extended Analysis of Guided Action Search and TDQC Efficiency. The results demonstrate that RNN-TDQC provides the highest success rates, while the Threshold 0.35 variant offers a significant reduction in computational overhead by selectively triggering action search only when safety is at risk while maintaining high success rates. While results in the paper summarized the average success rate across all ta… view at source ↗

**Figure 6.** Figure 6: Failures and successes detected by RNN-TDQC (top-10 probabilities) align with the actual robot failures, as shown in the observations from OpenVLA + LIBERO-10 simulation. The green-shaded areas show the functional CP band. Once failure scores exceed the band, a failure flag is raised. E.6. WidowX analysis The high ROC-AUC results in [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Successful rollout with informative failure scores of TDQC top 10 probabilities on OpenVLA LIBERO-10 benckmark. task: “put both the alphabet soup and the tomato sauce in the basket”. The failure score rises when the policy becomes temporarily stuck while trying to drop the alphabet soup into the basket around step 140. Then, failure score decreases after recovery around step 275, and increases again when t… view at source ↗

**Figure 9.** Figure 9: Ablation results for TD methods Overall, we see that TD-0 with the top 10 probabilities achieve best performance 25 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 8.** Figure 8: Comparison between BCE and TD-based methods for the same rollout as in [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 10.** Figure 10: Additional failure detection analysis using thresholds obtained by functional CP. These plots show TPR (True positive rate, left column), and FPR (False positive rate, right column), w.r.t. the significance level α, for each evaluation benchmark. These plots are averaged across 21 seeds. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Analysis of VLA Calibration and Success Rates. (a-f) Scatter plots showing the strong negative correlation between Brier Score at Stop Time (Tˆ) and ROC-AUC across different model-benchmark pairs. E.11. Relation between sequential Brier and ECE Let us be reminded that the Brier score relates to the calibration and accuracy of the classifier by its two-component decomposition. Let F = f(X) denote the rando… view at source ↗

**Figure 12.** Figure 12: Brier scores val-seen sequential Brier score (lower is better) on the seen validation set averaged over 21 random seeds and all environments. We report Brier score in different time quantiles, where each subplot corresponds to a (model-benchmark) pair. For π0, action probabilities are not directly interpretable, hence probability-based TDQC variants are not reported. Across all settings, our TD-based meth… view at source ↗

**Figure 13.** Figure 13: Brier scores val-unseen sequential Brier score (lower is better) on the unseen validation set averaged over 21 random seeds. We report Brier score in different time quantiles, where each subplot corresponds to a (model-benchmark) pair. For π0, action probabilities are not directly interpretable, hence probability-based TDQC variants are not reported. Across all settings, our TD-based methods consistently … view at source ↗

**Figure 14.** Figure 14: ECE scores val-seen ECE scores (lower is better) on the seen validation set averaged over 21 random seeds and all environments. We report ECE scores in different time quantiles, where each subplot corresponds to a (model-benchmark) pair. We see the correlation between lower Brier score and lower ECE scores in almost all settings. This highlights the Brier score decomposition shown in Section 2.1 [PITH_FU… view at source ↗

**Figure 15.** Figure 15: ECE scores val-unseen ECE scores (lower is better) on the unseen validation set averaged over 21 random seeds and all environments. We report ECE scores in different time quantiles, where each subplot corresponds to a (model-benchmark) pair. We see the correlation between lower Brier score and lower ECE scores in almost all settings. This highlights the Brier score decomposition shown in Section 2.1 30 [… view at source ↗

**Figure 16.** Figure 16: ECE vs Brier scores We compared ECE scores to sequential Brier scores in all models and benchmarks. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

read the original abstract

Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulate sequential calibration for episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequential extension of the Brier score and show that, for binary outcomes, its risk minimizer coincides with the VLA policy's value function. This connection bridges uncertainty calibration and reinforcement learning, enabling the use of temporal-difference (TD) value estimation as a principled calibration mechanism over time. We empirically show that TD calibration improves performance relative to the state-of-the-art on simulated and real-robot data. Interestingly, we show that when calibrated using TD, the VLA's single-step action probabilities can yield competitive uncertainty estimates, in contrast to recent findings that employed different calibration techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links sequential Brier calibration to the value function via TD learning for episodic binary-success VLA tasks and reports empirical gains over prior methods.

read the letter

The main point is that once you define a sequential Brier score as the expected squared error between intermediate success probabilities and the terminal binary outcome, its minimizer is exactly the policy value function. TD estimation then becomes a direct way to calibrate those probabilities over time. They apply this to VLA models and show better results than existing calibration techniques on both simulated and real-robot data. A side observation is that TD-calibrated single-step action probabilities end up competitive for uncertainty, unlike what other methods produced.

Referee Report

2 major / 2 minor

Summary. The paper introduces a sequential extension of the Brier score for calibrating uncertainty estimates produced along trajectories in episodic vision-language-action (VLA) tasks, where success is binary and observed only at termination. It claims that the risk minimizer of this score coincides with the policy's value function, thereby justifying the use of temporal-difference (TD) value estimation as a calibration method. Empirical results on simulated and real-robot data show performance gains over prior calibration techniques, and that TD-calibrated single-step action probabilities become competitive for uncertainty quantification.

Significance. If the claimed equivalence holds, the work supplies a direct theoretical link between proper scoring rules and RL value functions that is specific to binary episodic settings; this is a clean bridge with potential utility for uncertainty-aware robotics. The empirical demonstration on real-robot data is a concrete strength, as is the observation that TD calibration can rehabilitate single-step probabilities. The result is internally consistent within its stated scope but does not claim generality beyond episodic binary success.

major comments (2)

[theoretical development (abstract and §3)] The central claim that the sequential Brier score's risk minimizer equals the value function is load-bearing for the entire contribution, yet the manuscript supplies neither the explicit derivation steps nor the intermediate equations showing that the unique minimizer is the conditional expectation of the terminal outcome given the history. This omission leaves the mathematical bridge without verifiable support.
[§5] §5 (Experiments): performance gains are reported relative to SOTA, but the text does not describe experimental controls such as matched training budgets, hyperparameter search effort, or whether the TD calibration is applied on-policy versus off-policy in a manner comparable to the baselines. Without these, the empirical superiority claim cannot be assessed.

minor comments (2)

[§3] The definition of the sequential Brier score should be stated as an explicit equation (with a numbered label) rather than described only in prose, to allow readers to verify the risk-minimizer argument directly.
[§5] Results tables or figures should include error bars or statistical significance indicators for the reported improvements on both simulated and real-robot data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We appreciate the recognition of the theoretical link between proper scoring rules and value functions in episodic binary settings, as well as the strengths identified in the empirical evaluation on real-robot data. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and expansions.

read point-by-point responses

Referee: [theoretical development (abstract and §3)] The central claim that the sequential Brier score's risk minimizer equals the value function is load-bearing for the entire contribution, yet the manuscript supplies neither the explicit derivation steps nor the intermediate equations showing that the unique minimizer is the conditional expectation of the terminal outcome given the history. This omission leaves the mathematical bridge without verifiable support.

Authors: We agree that the derivation requires more explicit steps for verifiability. In the revised manuscript, we will expand Section 3 (and the abstract if needed for clarity) to include the complete proof. The argument proceeds by first writing the expected sequential Brier score as an expectation over full trajectories, then showing via the law of total expectation and the properness of the Brier score that the unique minimizer at each history is the conditional probability of eventual success given that history—which is exactly the policy value function in this binary episodic setting. All intermediate equations will be provided, along with a note on uniqueness for binary outcomes. revision: yes
Referee: [§5] §5 (Experiments): performance gains are reported relative to SOTA, but the text does not describe experimental controls such as matched training budgets, hyperparameter search effort, or whether the TD calibration is applied on-policy versus off-policy in a manner comparable to the baselines. Without these, the empirical superiority claim cannot be assessed.

Authors: We concur that additional experimental details are necessary for reproducibility and fair assessment. In the revised Section 5, we will insert a dedicated paragraph (or subsection) specifying: (i) matched training budgets and total environment steps across all methods, (ii) the hyperparameter search protocol (including ranges, number of trials, and selection metric), and (iii) confirmation that TD calibration is applied off-policy on the same rollout data used by the baselines, with on-policy variants also reported for completeness where relevant. Any differences in compute will be noted. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The central claim is a direct mathematical consequence of defining the sequential Brier score as the expected squared deviation from the terminal binary outcome: its unique risk minimizer is necessarily the conditional expectation of that outcome, which is exactly the policy value function in the episodic setting. The paper introduces the sequential Brier extension explicitly and then states the equivalence as a derived property rather than an assumption or fit. TD estimation is then the standard off-policy method for estimating this quantity. No load-bearing step reduces to a self-citation, a fitted parameter renamed as prediction, or an ansatz smuggled from prior work; the derivation is self-contained and holds under the stated episodic binary-success assumptions without tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of episodic binary-success tasks and the applicability of standard TD value estimation; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Episodic tasks with binary success outcomes determined at the end of the episode.
This setting is required for the sequential Brier risk minimizer to coincide with the value function.

pith-pipeline@v0.9.0 · 5476 in / 1109 out tokens · 53644 ms · 2026-05-09T23:56:21.370237+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 26 canonical work pages · 10 internal anchors

[1]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

1997
[2]

Nature , volume=

Mastering the game of Go with deep neural networks and tree search , author=. Nature , volume=
[3]

Nature , volume=

Mastering atari, go, chess and shogi by planning with a learned model , author=. Nature , volume=. 2020 , publisher=

2020
[4]

2026 , publisher =

Mannor, Shie and Mansour, Yishay and Tamar, Aviv , title =. 2026 , publisher =

2026
[5]

Conference on Learning Theory , year=

On the distance from calibration in sequential prediction , author=. Conference on Learning Theory , year=
[6]

arXiv preprint arXiv:2002.02644 , year=

Temporal probability calibration , author=. arXiv preprint arXiv:2002.02644 , year=

work page arXiv 2002
[7]

Journal of Machine Learning Research , volume =

Effective Ways to Build and Evaluate Individual Survival Distributions , author =. Journal of Machine Learning Research , volume =
[8]

Bellman Calibration for $V$-Learning in Offline Reinforcement Learning

Bellman Calibration for V-Learning in Offline Reinforcement Learning , author=. arXiv preprint arXiv:2512.23694 , year=

work page internal anchor Pith review arXiv
[9]

Journal of the American Statistical Association , volume =

Strictly Proper Scoring Rules, Prediction, and Estimation , author =. Journal of the American Statistical Association , volume =
[10]

Journal of Mathematical Psychology , volume=

The Area Above the Ordinal Dominance Graph and the Area Below the Receiver Operating Characteristic Graph , author=. Journal of Mathematical Psychology , volume=
[11]

Machine Learning , volume=

A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , author=. Machine Learning , volume=
[12]

Statistical Methods in the Atmospheric Sciences , author=
[13]

Journal of Applied Meteorology and Climatology , volume=

A new vector partition of the probability score , author=. Journal of Applied Meteorology and Climatology , volume=
[14]

The Annals of Applied Statistics , volume=

Learn then test: Calibrating predictive algorithms to achieve risk control , author=. The Annals of Applied Statistics , volume=. 2025 , publisher=

2025
[15]

Statistics in medicine , volume=

Multiple testing in clinical trials , author=. Statistics in medicine , volume=. 1991 , publisher=

1991
[16]

Econometrica: journal of the Econometric Society , pages=

Regression quantiles , author=. Econometrica: journal of the Econometric Society , pages=. 1978 , publisher=

1978
[17]

Advances in neural information processing systems , volume=

Conformalized quantile regression , author=. Advances in neural information processing systems , volume=
[18]

MIT press , year=

Reinforcement learning: An introduction , author=. MIT press , year=
[19]

RT-1: Robotics Transformer for Real-World Control at Scale

Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=

work page internal anchor Pith review arXiv
[20]

Conference on Robot Learning , year=

Openvla: An open-source vision-language-action model , author=. Conference on Robot Learning , year=
[21]

Octo: An Open-Source Generalist Robot Policy

Octo: An open-source generalist robot policy , author=. arXiv preprint arXiv:2405.12213 , year=

work page internal anchor Pith review arXiv
[22]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

pi\_0 : A Vision-Language-Action Flow Model for General Robot Control , author=. arXiv preprint arXiv:2410.24164 , year=

work page internal anchor Pith review arXiv
[23]

Forty-first International Conference on Machine Learning , year=

Prismatic vlms: Investigating the design space of visually-conditioned language models , author=. Forty-first International Conference on Machine Learning , year=
[24]

International Conference on Robotics and Automation , year=

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0 , author=. International Conference on Robotics and Automation , year=
[25]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Droid: A large-scale in-the-wild robot manipulation dataset , author=. arXiv preprint arXiv:2403.12945 , year=

work page internal anchor Pith review arXiv
[26]

Conference on Robot Learning , pages=

Bridgedata v2: A dataset for robot learning at scale , author=. Conference on Robot Learning , pages=. 2023 , organization=

2023
[27]

Monthly Weather Review, 78(1):1–3, 1950 , year=

Verification of forecasts expressed in terms of probability , author=. Monthly Weather Review, 78(1):1–3, 1950 , year=

1950
[28]

Robot Evaluation for the Real World , year=

Can we detect failures without failure data? uncertainty-aware runtime failure detection for imitation learning policies , author=. Robot Evaluation for the Real World , year=
[29]

Advances in Neural Information Processing Systems , year=

SAFE: Multitask Failure Detection for Vision-Language-Action Models , author=. Advances in Neural Information Processing Systems , year=
[30]

Römer, A

Failure Prediction at Runtime for Generative Robot Policies , author=. arXiv preprint arXiv:2510.09459 , year=

work page arXiv
[31]

The Thirteenth International Conference on Learning Representations , year=

Vision language models are in-context value learners , author=. The Thirteenth International Conference on Learning Representations , year=
[32]

Forty-first International Conference on Machine Learning , year=

Rediffuser: Reliable decision-making using a diffuser with confidence estimation , author=. Forty-first International Conference on Machine Learning , year=
[33]

International Conference on Machine Learning , year=

On calibration of modern neural networks , author=. International Conference on Machine Learning , year=
[34]

arXiv preprint arXiv:2102.06746 , year=

The importance of being a band: Finite-sample exact distribution-free prediction sets for functional data , author=. arXiv preprint arXiv:2102.06746 , year=

work page arXiv
[35]

International Conference on Machine Learning , pages=

Why target networks stabilise temporal difference methods , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[36]

Journal of Mathematical Analysis and Applications , volume=

Optimal control of. Journal of Mathematical Analysis and Applications , volume=. 1965 , publisher=

1965
[37]

multi-option

Confidence Calibration in Vision-Language-Action Models , author=. arXiv preprint arXiv:2507.17383 , year=

work page arXiv
[38]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Fine-tuning vision-language-action models: Optimizing speed and success , author=. arXiv preprint arXiv:2502.19645 , year=

work page internal anchor Pith review arXiv
[39]

Conference on Robot Learning , year=

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , year=
[40]

Vision-language foundation models as effective robot imitators,

Vision-language foundation models as effective robot imitators , author=. arXiv preprint arXiv:2311.01378 , year=

work page arXiv
[41]

International Conference on Learning Representations , year=

Unified Vision-Language-Action Model , author=. International Conference on Learning Representations , year=
[42]

Discrete diffusion vla: Bring- ing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies , author=. arXiv preprint arXiv:2508.20072 , year=

work page arXiv
[43]

Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,

Nora: A small open-sourced generalist vision language action model for embodied tasks , author=. arXiv preprint arXiv:2504.19854 , year=

work page arXiv
[44]

WACV , year =

Beyond Classification: Definition and Density-based Estimation of Calibration in Object Detection , author =. WACV , year =
[45]

International Conference on Machine Learning , year =

A Large-Scale Study of Probabilistic Calibration in Neural Network Regression , author =. International Conference on Machine Learning , year =
[46]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Fast: Efficient action tokenization for vision-language-action models , author=. arXiv preprint arXiv:2501.09747 , year=

work page internal anchor Pith review arXiv
[47]

Advances in Neural Information Processing Systems , year=

Libero: Benchmarking knowledge transfer for lifelong robot learning , author=. Advances in Neural Information Processing Systems , year=
[48]

Conference on Learning Representations , year=

Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms , author=. Conference on Learning Representations , year=
[49]

Transactions of the Association for Computational Linguistics , volume=

How can we know when language models know? on the calibration of language models for question answering , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

2021
[50]

Advances in large margin classifiers , volume=

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods , author=. Advances in large margin classifiers , volume=. 1999 , publisher=

1999
[51]

Conference on Empirical Methods in Natural Language Processing , year=

Uncertainty in language models: Assessment through rank-calibration , author=. Conference on Empirical Methods in Natural Language Processing , year=
[52]

ACM Computing Surveys , year=

A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions , author=. ACM Computing Surveys , year=
[53]

Conference on Robot Learning , year=

Robots that ask for help: Uncertainty alignment for large language model planners , author=. Conference on Robot Learning , year=
[54]

Look before you leap: An exploratory study of uncertainty measurement for large language models,

Look before you leap: An exploratory study of uncertainty measurement for large language models , author=. arXiv preprint arXiv:2307.10236 , year=

work page arXiv
[55]

Nature , volume=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=. 2024 , publisher=

2024
[56]

Findings of the Association for Computational Linguistics , year=

A survey of uncertainty estimation methods on large language models , author=. Findings of the Association for Computational Linguistics , year=
[58]

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , year=

Orgad, Hadas and Toker, Michael and Gekhman, Zorik and Reichart, Roi and Szpektor, Idan and Kotek, Hadas and Belinkov, Yonatan , booktitle=. LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , year=
[59]

Playing Atari with Deep Reinforcement Learning

Playing atari with deep reinforcement learning , author=. arXiv preprint arXiv:1312.5602 , year=

work page internal anchor Pith review arXiv
[60]

International Conference on Machine Learning , year=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International Conference on Machine Learning , year=
[61]

Conference on Learning Theory , year=

Bias-Variance Error Bounds for Temporal Difference Updates , author=. Conference on Learning Theory , year=
[62]

Journal of the American Statistical Association , volume=

Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American Statistical Association , volume=. 2007 , publisher=

2007
[63]

, author=

A tutorial on conformal prediction. , author=. Journal of Machine Learning Research , volume=
[64]

International Conference on Learning Representations , year=

Decoupled weight decay regularization , author=. International Conference on Learning Representations , year=
[65]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

arXiv preprint arXiv:2503.14043 , year=

Learning on LLM Output Signatures for Gray-Box Behavior Analysis , author=. arXiv preprint arXiv:2503.14043 , year=

work page arXiv
[67]

The Internal State of an

Azaria, Amos and Mitchell, Tom , booktitle=. The Internal State of an
[68]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review arXiv
[69]

Malik, and Yarin Gal

Semantic entropy probes: Robust and cheap hallucination detection in llms , author=. arXiv preprint arXiv:2406.15927 , year=

work page arXiv
[70]

arXiv preprint arXiv:2407.08735 , year=

Real-time anomaly detection and reactive planning with large language models , author=. arXiv preprint arXiv:2407.08735 , year=

work page arXiv
[71]

Conference on Robot Learning , year=

Error-aware imitation learning from teleoperation data for mobile manipulation , author=. Conference on Robot Learning , year=
[72]

Conference on Robot Learning , year=

Predictive red teaming: Breaking policies without breaking robots , author=. Conference on Robot Learning , year=
[73]

International Conference on Robotics and Automation , year=

Asking for help: Failure prediction in behavioral cloning through value approximation , author=. International Conference on Robotics and Automation , year=
[74]

Advances in Neural Information Processing Systems , year=

When to ask for help: Proactive interventions in autonomous reinforcement learning , author=. Advances in Neural Information Processing Systems , year=
[75]

Robotics: Science and Systems , year=

Failure Prediction with Statistical Guarantees for Vision-Based Robot Control , author=. Robotics: Science and Systems , year=
[76]

arXiv preprint arXiv:2007.00245 , year=

Fighting failures with fire: Failure identification to reduce expert burden in intervention-based learning , author=. arXiv preprint arXiv:2007.00245 , year=

work page arXiv 2007
[77]

2005 , publisher=

Algorithmic learning in a random world , author=. 2005 , publisher=

2005
[78]

Evaluating gemini robotics policies in a veo world simulator, 2025

Evaluating Gemini Robotics Policies in a Veo World Simulator , author=. arXiv preprint arXiv:2512.10675 , year=

work page arXiv
[79]

Reliable and scalable robot policy eval- uation with imperfect simulators.arXiv preprint arXiv:2510.04354, 2025

Reliable and scalable robot policy evaluation with imperfect simulators , author=. arXiv preprint arXiv:2510.04354 , year=

work page arXiv
[80]

Under review

Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation , author=. arXiv preprint arXiv:2512.22245 , year=

work page arXiv
[81]

International Conference on Machine Learning , year=

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL , author=. International Conference on Machine Learning , year=

Showing first 80 references.