pith. sign in

arxiv: 2606.12587 · v1 · pith:QSNWRR6Vnew · submitted 2026-06-10 · 💻 cs.AI · cs.HC

Strategic Decision Support for AI Agents

Pith reviewed 2026-06-27 09:56 UTC · model grok-4.3

classification 💻 cs.AI cs.HC
keywords AI agentsdecision supportthreshold policyonline algorithmmissed-support errorrandomized explorationvalue of support
0
0 comments X

The pith

AI agents optimally decide when to seek support by thresholding a value-of-support score to control missed-support errors while minimizing calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes decision support around AI agents as the main actors, with humans and tools providing backup. It sets up an optimization that minimizes how often support is requested while bounding the chance that the agent proceeds alone on cases where support would have helped. At the population level this yields a simple threshold rule on a scalar value of support. An online procedure then learns the right threshold on the fly through randomized exploration, achieving the error bound without assuming any particular data distribution. The same structure is shown to cover information gathering, human collaboration, and tool-use settings.

Core claim

At the population level, the optimal policy is a threshold rule on the value of support. Building on this structure, an online algorithm adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. A calibration-on-the-fly method further reduces unnecessary support calls.

What carries the argument

The optimization problem that minimizes support usage subject to a bound on counterfactual missed-support error, whose solution is the threshold rule on the value of support.

If this is right

  • The population optimum is exactly a threshold on the value of support.
  • Randomized exploration in the online algorithm achieves the target error bound without distributional assumptions.
  • Calibration-on-the-fly further trims excess support calls while preserving the error guarantee.
  • The same threshold structure applies uniformly to information gathering, human-AI collaboration, and tool-use problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could let agents run longer in deployment before human intervention is needed, provided the value-of-support score stays stable.
  • Extending the framework to settings where multiple agents can support one another would require only redefining the support action and its value.
  • Real-world logs of agent decisions and outcomes could be used to test whether the learned thresholds remain effective when the environment drifts.

Load-bearing premise

A scalar value of support can be defined and scored so that thresholding it reliably controls the counterfactual missed-support error.

What would settle it

A controlled experiment in one of the modeled scenarios where the adaptive threshold rule plus randomized exploration still lets the missed-support error exceed its target bound.

Figures

Figures reproduced from arXiv: 2606.12587 by George Pappas, Hamed Hassani, Shayan Kiyani, Sima Noorani.

Figure 1
Figure 1. Figure 1: Effect of strategic decision support oversight. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our method invokes decision support substantially less often than an LLM-decides baseline, while matching its error rate. For each of four agentic applications: information gathering (DDXPlus), tool use (WikiSQL), human-in-the-loop planning (VirtualHome), and collaborative human–AI reasoning (MATH), all using Gemini-2.5-Flash, we report two pairs of bars. The left pair (solid) shows the cumulative support … view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative missed-support error on all four tasks with Qwen-2.5-7B as the agent. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative support rate SRcT = 1 T PT t=1 at across all task–model pairs at matched missed-support error. We show the best-performing variant across both families, paired with its same-embedding counterpart from the other family. Full per-panel comparisons in Appendix B.1. Calibration-on-the-fly recovers from uninformative signals. The Representation family reliably reduces the support rate relative to LLM… view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative support rate SRcT = 1 T PT t=1 at across all task–model pairs, showing every score variant. Rows are base agents, columns are tasks. All variants are run at the same target α as in [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of the exploration probability µ. Base agent: GPT-4o-mini. Task: DDXPlus. Score: Anchored-Gemini. Left: cumulative missed-support error against the target α. Right: cumulative support rate, with the LLM-Decides baseline shown for reference. Larger µ tightens error control and yields smoother convergence but increases support usage, matching the dependence on µ in the slack term of Theorem 4.1. 25 [… view at source ↗
Figure 7
Figure 7. Figure 7: Gemini-2.5-Flash on VirtualHome under two gain definitions. Columns are gain definitions: [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Score distributions along the online stream for Gemini-2.5-Flash, split by the latent benefit variable [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cumulative missed-support error MSE(T) across all task–model pairs. Rows are base agents (Qwen-2.5-7B, Gemini-2.5-Flash, GPT-4o-mini), columns are tasks (DDXPlus, WikiSQL, VirtualHome, MATH). Each panel shows the running MSE for all score variants together with the LLM-Decides baseline and the target level α. All variants converge towards α regardless of the score family. Model Variant DDXPlus WikiSQL Virt… view at source ↗
Figure 10
Figure 10. Figure 10: Score-input ablation. Base agent: Gemini-2.5-Flash. Task: DDXPlus. Score: Anchored-Gemini. [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
read the original abstract

Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms around them. This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints. Departing from the classical view of decision support, we revisit its two basic principles, the cost--value tradeoff of seeking support and the role of uncertainty quantification, in a setting where AI agents are the central actors. We propose a framework for strategic decision support for AI agents through an optimization problem that minimizes support usage subject to controlling a counterfactual missed-support error: the probability that the agent acts alone on instances where support would have materially improved its output. At the population level, we show that the optimal policy is a threshold rule on the value of support. Building on this structure, we develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. We further introduce a calibration-on-the-fly method that reduces unnecessary support calls online. We instantiate this framework across diverse scenarios, including information gathering, human--AI collaboration, and tool use, showing how each can be modeled through the same strategic decision-support lens. Experiments across these settings show that our method reliably controls the target error while substantially reducing support usage in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a framework for strategic decision support in AI agent systems. It formulates an optimization problem minimizing support usage subject to a bound on counterfactual missed-support error (probability that the agent acts without support on instances where support would have improved output). It claims that the optimal policy is a threshold rule on a scalar 'value of support', develops an online algorithm that adaptively thresholds a score via randomized exploration to control the error without distributional assumptions, introduces a calibration-on-the-fly method, and instantiates the framework in information gathering, human-AI collaboration, and tool-use scenarios, with experiments showing reliable error control and reduced support usage.

Significance. If the optimality result and online guarantee hold, the work provides a principled, assumption-light method for managing support calls in agentic systems, addressing reliability concerns in a role-reversed setting. The population-level threshold structure and no-distributional-assumption online control would be notable strengths if rigorously derived, as would the unified modeling across scenarios.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'the optimal policy is a threshold rule on the value of support' for the constrained optimization min support-usage s.t. P(missed-support) ≤ ε requires an explicit derivation showing that a scalar v(x) exists whose level sets directly bound the counterfactual error probability independently of the policy. The abstract states the result but supplies no conditions, proof sketch, or argument why the improvement from support can be summarized by one dimension in general agentic settings (joint distribution over actions, support outcomes, and responses).
  2. [Abstract] Abstract: The online algorithm is claimed to 'adaptively threshold such a score and use randomized exploration to control missed-support error without distributional assumptions.' This guarantee is load-bearing for the contribution, yet the abstract provides no argument or sketch showing that the control is independent of score construction rather than reducing to a fitted quantity by construction. The paper must demonstrate why randomized exploration alone suffices across the modeled scenarios.
minor comments (2)
  1. [Abstract] Abstract: 'human and tools becomes support mechanisms' contains a subject-verb agreement error.
  2. [Abstract] Abstract: The phrase 'calibration-on-the-fly method that reduces unnecessary support calls online' is introduced without indicating how it interacts with the main threshold algorithm or whether it preserves the error guarantee.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. Both major comments concern the abstract's presentation of the core results. The full manuscript contains the derivations (Section 3 for the threshold policy and Section 4 for the online algorithm), but we agree the abstract can be strengthened with brief sketches and conditions. We will revise the abstract accordingly while preserving its length. No standing objections remain after these clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'the optimal policy is a threshold rule on the value of support' for the constrained optimization min support-usage s.t. P(missed-support) ≤ ε requires an explicit derivation showing that a scalar v(x) exists whose level sets directly bound the counterfactual error probability independently of the policy. The abstract states the result but supplies no conditions, proof sketch, or argument why the improvement from support can be summarized by one dimension in general agentic settings (joint distribution over actions, support outcomes, and responses).

    Authors: The manuscript derives this result in Section 3. We define the scalar value of support as v(x) = E[output improvement from support | x] minus any per-call cost, which is a one-dimensional summary of the relevant conditional expectation. Because the missed-support indicator is monotone in v(x), the population-level optimization admits a threshold policy on v(x) whose level sets directly control the counterfactual error probability independently of the specific policy form. The joint distribution is handled by taking the expectation over the relevant marginal. We will add a one-sentence sketch and the monotonicity condition to the abstract in revision. revision: yes

  2. Referee: [Abstract] Abstract: The online algorithm is claimed to 'adaptively threshold such a score and use randomized exploration to control missed-support error without distributional assumptions.' This guarantee is load-bearing for the contribution, yet the abstract provides no argument or sketch showing that the control is independent of score construction rather than reducing to a fitted quantity by construction. The paper must demonstrate why randomized exploration alone suffices across the modeled scenarios.

    Authors: Section 4 proves the guarantee via a distribution-free argument: randomized exploration (with probability decaying as 1/t) ensures that the empirical missed-support rate is a martingale whose deviation from the target ε can be bounded by a Hoeffding-type inequality that holds for any fixed score function. The threshold is then adapted online to keep the rate below ε; the proof never relies on the score being correctly specified or on any particular data distribution, only on the ability to observe the missed-support outcome after each decision. This applies uniformly to the information-gathering, human-AI, and tool-use instantiations. We will insert a concise statement of this independence into the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained standard optimization result

full rationale

The paper defines a constrained optimization problem (minimize support usage subject to bounding counterfactual missed-support error) and states that its solution is a threshold rule on a scalar 'value of support.' This is a direct, non-circular consequence of the standard Lagrange-multiplier structure for such problems once the value is defined as the conditional improvement probability; the derivation does not reduce the claimed result to a fitted parameter or self-citation. The online algorithm is then constructed on top of that structure using randomized exploration, with the error-control guarantee following from the exploration mechanism rather than from re-fitting the same quantity. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results are present in the provided text. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on the existence of a quantifiable support value and the ability of randomized exploration to control the target error without distributional assumptions; these are domain assumptions stated in the abstract.

free parameters (1)
  • threshold on value of support
    The optimal policy is defined as a threshold on this value, which must be estimated or learned online.
axioms (2)
  • domain assumption A scalar value of support exists that determines whether support materially improves the agent's output.
    This is required for the threshold rule and the definition of missed-support error.
  • domain assumption Randomized exploration controls the missed-support error without distributional assumptions.
    This is the key property claimed for the online algorithm.

pith-pipeline@v0.9.1-grok · 5793 in / 1339 out tokens · 25164 ms · 2026-06-27T09:56:21.389906+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

97 extracted references · 6 canonical work pages

  1. [1]

    Semantically diverse language generation for uncertainty estimation in language models, 2024

    Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter. Semantically diverse language generation for uncertainty estimation in language models, 2024. URLhttps://arxiv.org/ abs/2406.04306

  2. [2]

    Chinmaya Andukuri, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah D. Goodman. Star-gate: Teaching language models to ask clarifying questions, 2024. URLhttps://arxiv.org/abs/2403.19154

  3. [3]

    Angelopoulos, Emmanuel J

    Anastasios N. Angelopoulos, Emmanuel J. Candes, and Ryan J. Tibshirani. Conformal pid control for time series prediction, 2023. URLhttps://arxiv.org/abs/2307.16895

  4. [4]

    Towards human-ai complementarity in matching tasks, 2025

    Adrian Arnaiz-Rodriguez, Nina Corvelo Benz, Suhas Thejaswi, Nuria Oliver, and Manuel Gomez- Rodriguez. Towards human-ai complementarity in matching tasks, 2025. URLhttps://arxiv.org/ abs/2508.13285

  5. [5]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023. URLhttps://arxiv.org/abs/2310.11511

  6. [6]

    On the utility of prediction sets in human-ai teams,

    Varun Babbar, Umang Bhatt, and Adrian Weller. On the utility of prediction sets in human-ai teams,

  7. [7]

    URLhttps://arxiv.org/abs/2205.01411

  8. [8]

    Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, and Daniel S. Weld. Is the most accurate ai the best teammate? optimizing ai for teamwork, 2021. URLhttps://arxiv.org/abs/2004.13102

  9. [9]

    Corvelo Benz and Manuel Gomez Rodriguez

    Nina L. Corvelo Benz and Manuel Gomez Rodriguez. Human-alignment influences the utility of ai-assisted decision making, 2025. URLhttps://arxiv.org/abs/2501.14035

  10. [10]

    A bandit model for human-machine decision making with private information and opacity

    Sebastian Bordt and Ulrike Von Luxburg. A bandit model for human-machine decision making with private information and opacity. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors,Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 ofProceedings of Machine Learning Research, pages 7300–...

  11. [11]

    The assistive multi- armed bandit, 2019

    Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, and Anca Dragan. The assistive multi- armed bandit, 2019. URLhttps://arxiv.org/abs/1901.08654

  12. [12]

    Sample efficient learning of predictors that complement humans, 2022

    Mohammad-Amin Charusaie, Hussein Mozannar, David Sontag, and Samira Samadi. Sample efficient learning of predictors that complement humans, 2022. URLhttps://arxiv.org/abs/2207.09584

  13. [13]

    Frugalgpt: How to use large language models while reducing cost and improving performance, 2023

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance, 2023. URLhttps://arxiv.org/abs/2305.05176. 13

  14. [14]

    Cherian, Isaac Gibbs, and Emmanuel J

    John J. Cherian, Isaac Gibbs, and Emmanuel J. Candès. Large language model validity via enhanced conformal prediction methods, 2024. URLhttps://arxiv.org/abs/2406.09714

  15. [15]

    Stevenson

    Bo Cowgill and Megan T. Stevenson. Algorithmic social engineering.AEA Papers and Proceedings, 110: 96–100, May 2020. doi: 10.1257/pandp.20201037. URLhttps://www.aeaweb.org/articles?id=10. 1257/pandp.20201037

  16. [16]

    Regression under human assistance, 2021

    Abir De, Nastaran Okati, Paramita Koley, Niloy Ganguly, and Manuel Gomez-Rodriguez. Regression under human assistance, 2021. URLhttps://arxiv.org/abs/1909.02963

  17. [17]

    Classification under human assistance, 2021

    Abir De, Nastaran Okati, Ali Zarezade, and Manuel Gomez-Rodriguez. Classification under human assistance, 2021. URLhttps://arxiv.org/abs/2006.11845

  18. [18]

    Prompting and evaluating large language models for proactive dialogues: Clarification, target-guided, and non- collaboration, 2023

    Yang Deng, Lizi Liao, Liang Chen, Hongru Wang, Wenqiang Lei, and Tat-Seng Chua. Prompting and evaluating large language models for proactive dialogues: Clarification, target-guided, and non- collaboration, 2023. URLhttps://arxiv.org/abs/2305.13626

  19. [19]

    When are two lists better than one?: Benefits and harms in joint decision-making, 2024

    Kate Donahue, Sreenivas Gollapudi, and Kostas Kollias. When are two lists better than one?: Benefits and harms in joint decision-making, 2024. URLhttps://arxiv.org/abs/2308.11721

  20. [20]

    Value of information: A framework for human-agent communication, 2026

    Yijiang River Dong, Tiancheng Hu, Zheng Hui, Caiqi Zhang, Ivan Vulić, Andreea Bobu, and Nigel Collier. Value of information: A framework for human-agent communication, 2026. URL https: //arxiv.org/abs/2601.06407

  21. [21]

    Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models, 2024

    Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models, 2024. URLhttps://arxiv.org/abs/2307.01379

  22. [22]

    Onthefoundationsofnoise-freeselectiveclassification.Journal of Machine Learning Research, 11(53):1605–1641, 2010

    RanEl-YanivandYairWiener. Onthefoundationsofnoise-freeselectiveclassification.Journal of Machine Learning Research, 11(53):1605–1641, 2010. URLhttp://jmlr.org/papers/v11/el-yaniv10a.html

  23. [24]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

  24. [25]

    Human-centered human-ai collaboration (hchac), 2025

    Qi Gao, Wei Xu, Hanxi Pan, Mowei Shen, and Zaifeng Gao. Human-centered human-ai collaboration (hchac), 2025. URLhttps://arxiv.org/abs/2505.22477

  25. [26]

    Selective classification for deep neural networks, 2017

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks, 2017. URL https://arxiv.org/abs/1705.08500

  26. [27]

    Selectivenet: A deep neural network with an integrated reject option, 2019

    Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option, 2019. URLhttps://arxiv.org/abs/1901.09192

  27. [28]

    Adaptive conformal inference under distribution shift, 2021

    Isaac Gibbs and Emmanuel Candès. Adaptive conformal inference under distribution shift, 2021. URL https://arxiv.org/abs/2106.00170

  28. [29]

    Towards uncertainty-aware language agent, 2024

    Jiuzhou Han, Wray Buntine, and Ehsan Shareghi. Towards uncertainty-aware language agent, 2024. URLhttps://arxiv.org/abs/2401.14016

  29. [30]

    Learning to defer with limited expert predictions, 2023

    Patrick Hemmer, Lukas Thede, Michael Vössing, Johannes Jakubik, and Niklas Kühl. Learning to defer with limited expert predictions, 2023. URLhttps://arxiv.org/abs/2304.07306

  30. [31]

    Measuring mathematical problem solving with the math dataset, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874

  31. [32]

    Conformal prediction and human decision making, 2025

    Jessica Hullman, Yifan Wu, Dawei Xie, Ziyang Guo, and Andrew Gelman. Conformal prediction and human decision making, 2025. URLhttps://arxiv.org/abs/2503.11709. 14

  32. [33]

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity, 2024. URLhttps: //arxiv.org/abs/2403.14403

  33. [34]

    Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

    Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation, 2023. URLhttps://arxiv.org/ abs/2305.06983

  34. [35]

    Large language models must be taught to know what they don’t know, 2025

    Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. Large language models must be taught to know what they don’t know, 2025. URLhttps://arxiv.org/abs/2406.08391

  35. [36]

    Towards unbiased and accurate deferral to multiple experts, 2021

    Vijay Keswani, Matthew Lease, and Krishnaram Kenthapadi. Towards unbiased and accurate deferral to multiple experts, 2021. URLhttps://arxiv.org/abs/2102.13004

  36. [37]

    When to trust the cheap check: Weak and strong verification for reasoning, 2026

    Shayan Kiyani, Sima Noorani, George Pappas, and Hamed Hassani. When to trust the cheap check: Weak and strong verification for reasoning, 2026. URLhttps://arxiv.org/abs/2602.17633

  37. [38]

    Conformal generative modeling with improved sample efficiency through sequential greedy filtering, 2025

    Klaus-Rudolf Kladny, Bernhard Schölkopf, and Michael Muehlebach. Conformal generative modeling with improved sample efficiency through sequential greedy filtering, 2025. URLhttps://arxiv.org/ abs/2410.01660

  38. [39]

    Algorithmic monoculture and social welfare.Proceedings of the National Academy of Sciences, 118(22), May 2021

    Jon Kleinberg and Manish Raghavan. Algorithmic monoculture and social welfare.Proceedings of the National Academy of Sciences, 118(22), May 2021. ISSN 1091-6490. doi: 10.1073/pnas.2018340118. URL http://dx.doi.org/10.1073/pnas.2018340118

  39. [40]

    Clam: Selective clarification for ambiguous questions with generative language models, 2023

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Clam: Selective clarification for ambiguous questions with generative language models, 2023. URLhttps://arxiv.org/abs/2212.07769

  40. [41]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023. URL https://arxiv.org/abs/2302. 09664

  41. [42]

    Conformal prediction with large language models for multi-choice question answering, 2023

    Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering, 2023. URL https://arxiv.org/abs/2305.18404

  42. [43]

    Li, Alex Tamkin, Noah Goodman, and Jacob Andreas

    Belinda Z. Li, Alex Tamkin, Noah Goodman, and Jacob Andreas. Eliciting human preferences with language models, 2023. URLhttps://arxiv.org/abs/2310.11589

  43. [44]

    Conftuner: Training large language models to express their confidence verbally, 2025

    Yibo Li, Miao Xiong, Jiaying Wu, and Bryan Hooi. Conftuner: Training large language models to express their confidence verbally, 2025. URLhttps://arxiv.org/abs/2508.18847

  44. [45]

    Uncertainty estimation and quantification for llms: A simple supervised approach, 2024

    Linyu Liu, Yu Pan, Xiaocheng Li, and Guanting Chen. Uncertainty estimation and quantification for llms: A simple supervised approach, 2024. URLhttps://arxiv.org/abs/2404.15993

  45. [46]

    Multi-group uncertainty quantification for long-form text generation,

    Terrance Liu and Zhiwei Steven Wu. Multi-group uncertainty quantification for long-form text generation,

  46. [47]

    URLhttps://arxiv.org/abs/2407.21057

  47. [48]

    Predict responsibly: Improving fairness and accuracy by learning to defer, 2018

    David Madras, Toniann Pitassi, and Richard Zemel. Predict responsibly: Improving fairness and accuracy by learning to defer, 2018. URLhttps://arxiv.org/abs/1711.06664

  48. [49]

    Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucina- tion detection for generative large language models, 2023. URLhttps://arxiv.org/abs/2303.08896

  49. [50]

    Two-stage learning to defer with multiple experts

    Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, ed- itors,Advances in Neural Information Processing Systems, volume 36, pages 3578–3606. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/...

  50. [51]

    Language models with conformal factuality guarantees,

    Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees,

  51. [52]

    URLhttps://arxiv.org/abs/2402.10978

  52. [53]

    Optimal query allocation in extractive qa with llms: A learning-to-defer framework with theoretical guarantees, 2025

    Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Optimal query allocation in extractive qa with llms: A learning-to-defer framework with theoretical guarantees, 2025. URLhttps://arxiv.org/abs/2410.15761

  53. [54]

    Consistent estimators for learning to defer to an expert, 2021

    Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert, 2021. URLhttps://arxiv.org/abs/2006.01862

  54. [55]

    Human-ai collaborative uncertainty quantification.arXiv preprint arXiv:2510.23476, 2025

    Sima Noorani, Shayan Kiyani, George Pappas, and Hamed Hassani. Human-ai collaborative uncertainty quantification.arXiv preprint arXiv:2510.23476, 2025

  55. [56]

    Multi-round human-ai collaboration with user-specified requirements.arXiv preprint arXiv:2602.17646, 2026

    Sima Noorani, Shayan Kiyani, Hamed Hassani, and George Pappas. Multi-round human-ai collaboration with user-specified requirements.arXiv preprint arXiv:2602.17646, 2026

  56. [57]

    Differentiable learning under triage, 2021

    Nastaran Okati, Abir De, and Manuel Gomez-Rodriguez. Differentiable learning under triage, 2021. URLhttps://arxiv.org/abs/2103.08902

  57. [58]

    Gonzalez, M Waleed Kadous, and Ion Stoica

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2025. URLhttps: //arxiv.org/abs/2406.18665

  58. [59]

    Gpt-4o system card, 2024

    OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

  59. [60]

    Conformal arbitrage: Risk-controlled balancing of competing objectives in language models, 2025

    William Overman and Mohsen Bayati. Conformal arbitrage: Risk-controlled balancing of competing objectives in language models, 2025. URLhttps://arxiv.org/abs/2506.00911

  60. [61]

    Calibrate-then-delegate: Safety monitoring with risk and budget guarantees via model cascades,

    Edoardo Pona, Milad Kazemi, Mehran Hosseini, Yali Du, David Watson, Osvaldo Simeone, and Nicola Paoletti. Calibrate-then-delegate: Safety monitoring with risk and budget guarantees via model cascades,

  61. [62]

    URLhttps://arxiv.org/abs/2604.14251

  62. [63]

    Virtualhome: Simulating household activities via programs, 2018

    Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs, 2018. URLhttps://arxiv.org/abs/1806. 07011

  63. [64]

    Learning paradigms for hybrid decision-making systems.ACM Comput

    Clara Punzi, Roberto Pellungrini, Mattia Setzu, Fosca Giannotti, and Dino Pedreschi. Learning paradigms for hybrid decision-making systems.ACM Comput. Surv., April 2026. ISSN 0360-0300. doi: 10.1145/3802522. URLhttps://doi.org/10.1145/3802522. Just Accepted

  64. [65]

    Scent of knowledge: Optimizing search-enhanced reasoning with information foraging

    Hongjin Qian and Zheng Liu. Scent of knowledge: Optimizing search-enhanced reasoning with information foraging. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=26kUrQm4zw

  65. [66]

    Jaakkola, and Regina Barzilay

    Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling, 2024. URLhttps://arxiv.org/abs/2306.10193

  66. [67]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  67. [68]

    The algorithmic automation problem: Prediction, triage, and human effort, 2019

    Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil Mullainathan. The algorithmic automation problem: Prediction, triage, and human effort, 2019. URLhttps://arxiv. org/abs/1903.12220

  68. [69]

    The relationship between no-regret learning and online conformal prediction.arXiv preprint arXiv:2502.10947, 2025

    Ramya Ramalingam, Shayan Kiyani, and Aaron Roth. The relationship between no-regret learning and online conformal prediction.arXiv preprint arXiv:2502.10947, 2025. 16

  69. [70]

    A taxonomy of human and ml strengths in decision-making to investigate human-ml complementarity, 2023

    Charvi Rastogi, Liu Leqi, Kenneth Holstein, and Hoda Heidari. A taxonomy of human and ml strengths in decision-making to investigate human-ml complementarity, 2023. URLhttps://arxiv.org/abs/ 2204.10806

  70. [71]

    Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar. Robots that ask for help: Uncertainty alignment for large language model planners, 2023. URLhttps://arxiv.org/abs/2307.01928

  71. [72]

    When2call: When (not) to call tools,

    Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2call: When (not) to call tools,

  72. [73]

    URLhttps://arxiv.org/abs/2504.18851

  73. [74]

    Conformal language model reasoning with coherent factuality

    Maxon Rubin-Toles, Maya Gambhir, Keshav Ramji, Aaron Roth, and Surbhi Goel. Conformal language model reasoning with coherent factuality. InThe Thirteenth International Conference on Learning Representations

  74. [75]

    Toolformer: Language models can teach themselves to use tools,

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools,

  75. [76]

    URLhttps://arxiv.org/abs/2302.04761

  76. [77]

    Conformal prediction sets for deep generative models via reduction to conformal regression.arXiv preprint arXiv:2503.10512, 2025

    Hooman Shahrokhi, Devjeet Raj Roy, Yan Yan, Venera Arnaoudova, and Janaradhan Rao Doppa. Conformal prediction sets for deep generative models via reduction to conformal regression.arXiv preprint arXiv:2503.10512, 2025

  77. [78]

    Bayesian modeling of human ai complementarity.Proceedings of the National Academy of Sciences, 119(11):e2111547119, 2022

    Mark Steyvers, Heliodoro Tejeda, Gavin Kerrigan, and Padhraic Smyth. Bayesian modeling of human ai complementarity.Proceedings of the National Academy of Sciences, 119(11):e2111547119, 2022. doi: 10.1073/pnas.2111547119. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.2111547119

  78. [79]

    Improving expert predictions with conformal prediction, 2023

    Eleni Straitouri, Lequn Wang, Nastaran Okati, and Manuel Gomez Rodriguez. Improving expert predictions with conformal prediction, 2023. URLhttps://arxiv.org/abs/2201.12006

  79. [80]

    Controlling counterfactual harm in decision support systems based on prediction sets, 2024

    Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez. Controlling counterfactual harm in decision support systems based on prediction sets, 2024. URLhttps://arxiv.org/abs/2406.06671

  80. [81]

    Api is enough: Conformal prediction for large language models without logit-access, 2024

    Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng. Api is enough: Conformal prediction for large language models without logit-access, 2024. URLhttps://arxiv.org/abs/2403.01216

Showing first 80 references.