Strategic Decision Support for AI Agents

George Pappas; Hamed Hassani; Shayan Kiyani; Sima Noorani

arxiv: 2606.12587 · v1 · pith:QSNWRR6Vnew · submitted 2026-06-10 · 💻 cs.AI · cs.HC

Strategic Decision Support for AI Agents

Shayan Kiyani , Sima Noorani , George Pappas , Hamed Hassani This is my paper

Pith reviewed 2026-06-27 09:56 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords AI agentsdecision supportthreshold policyonline algorithmmissed-support errorrandomized explorationvalue of support

0 comments

The pith

AI agents optimally decide when to seek support by thresholding a value-of-support score to control missed-support errors while minimizing calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes decision support around AI agents as the main actors, with humans and tools providing backup. It sets up an optimization that minimizes how often support is requested while bounding the chance that the agent proceeds alone on cases where support would have helped. At the population level this yields a simple threshold rule on a scalar value of support. An online procedure then learns the right threshold on the fly through randomized exploration, achieving the error bound without assuming any particular data distribution. The same structure is shown to cover information gathering, human collaboration, and tool-use settings.

Core claim

At the population level, the optimal policy is a threshold rule on the value of support. Building on this structure, an online algorithm adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. A calibration-on-the-fly method further reduces unnecessary support calls.

What carries the argument

The optimization problem that minimizes support usage subject to a bound on counterfactual missed-support error, whose solution is the threshold rule on the value of support.

If this is right

The population optimum is exactly a threshold on the value of support.
Randomized exploration in the online algorithm achieves the target error bound without distributional assumptions.
Calibration-on-the-fly further trims excess support calls while preserving the error guarantee.
The same threshold structure applies uniformly to information gathering, human-AI collaboration, and tool-use problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could let agents run longer in deployment before human intervention is needed, provided the value-of-support score stays stable.
Extending the framework to settings where multiple agents can support one another would require only redefining the support action and its value.
Real-world logs of agent decisions and outcomes could be used to test whether the learned thresholds remain effective when the environment drifts.

Load-bearing premise

A scalar value of support can be defined and scored so that thresholding it reliably controls the counterfactual missed-support error.

What would settle it

A controlled experiment in one of the modeled scenarios where the adaptive threshold rule plus randomized exploration still lets the missed-support error exceed its target bound.

Figures

Figures reproduced from arXiv: 2606.12587 by George Pappas, Hamed Hassani, Shayan Kiyani, Sima Noorani.

**Figure 2.** Figure 2: Our method invokes decision support substantially less often than an LLM-decides baseline, while matching its error rate. For each of four agentic applications: information gathering (DDXPlus), tool use (WikiSQL), human-in-the-loop planning (VirtualHome), and collaborative human–AI reasoning (MATH), all using Gemini-2.5-Flash, we report two pairs of bars. The left pair (solid) shows the cumulative support … view at source ↗

**Figure 3.** Figure 3: Cumulative missed-support error on all four tasks with Qwen-2.5-7B as the agent. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Cumulative support rate SRcT = 1 T PT t=1 at across all task–model pairs at matched missed-support error. We show the best-performing variant across both families, paired with its same-embedding counterpart from the other family. Full per-panel comparisons in Appendix B.1. Calibration-on-the-fly recovers from uninformative signals. The Representation family reliably reduces the support rate relative to LLM… view at source ↗

**Figure 5.** Figure 5: Cumulative support rate SRcT = 1 T PT t=1 at across all task–model pairs, showing every score variant. Rows are base agents, columns are tasks. All variants are run at the same target α as in [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of the exploration probability µ. Base agent: GPT-4o-mini. Task: DDXPlus. Score: Anchored-Gemini. Left: cumulative missed-support error against the target α. Right: cumulative support rate, with the LLM-Decides baseline shown for reference. Larger µ tightens error control and yields smoother convergence but increases support usage, matching the dependence on µ in the slack term of Theorem 4.1. 25 [… view at source ↗

**Figure 7.** Figure 7: Gemini-2.5-Flash on VirtualHome under two gain definitions. Columns are gain definitions: [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Score distributions along the online stream for Gemini-2.5-Flash, split by the latent benefit variable [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Cumulative missed-support error MSE(T) across all task–model pairs. Rows are base agents (Qwen-2.5-7B, Gemini-2.5-Flash, GPT-4o-mini), columns are tasks (DDXPlus, WikiSQL, VirtualHome, MATH). Each panel shows the running MSE for all score variants together with the LLM-Decides baseline and the target level α. All variants converge towards α regardless of the score family. Model Variant DDXPlus WikiSQL Virt… view at source ↗

**Figure 10.** Figure 10: Score-input ablation. Base agent: Gemini-2.5-Flash. Task: DDXPlus. Score: Anchored-Gemini. [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗

read the original abstract

Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this division of roles is increasingly reversed: AI agents act on behalf of users, while humans and tools becomes support mechanisms around them. This role reversal brings reliability concerns to the forefront, since agentic errors can be consequential and agent behavior must remain aligned with human goals and constraints. Departing from the classical view of decision support, we revisit its two basic principles, the cost--value tradeoff of seeking support and the role of uncertainty quantification, in a setting where AI agents are the central actors. We propose a framework for strategic decision support for AI agents through an optimization problem that minimizes support usage subject to controlling a counterfactual missed-support error: the probability that the agent acts alone on instances where support would have materially improved its output. At the population level, we show that the optimal policy is a threshold rule on the value of support. Building on this structure, we develop an online algorithm that adaptively thresholds such a score and uses randomized exploration to control missed-support error without distributional assumptions. We further introduce a calibration-on-the-fly method that reduces unnecessary support calls online. We instantiate this framework across diverse scenarios, including information gathering, human--AI collaboration, and tool use, showing how each can be modeled through the same strategic decision-support lens. Experiments across these settings show that our method reliably controls the target error while substantially reducing support usage in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean optimization framing for when AI agents should seek support, but the no-distributional-assumption guarantee on the scalar threshold needs the full derivations to hold up.

read the letter

The main new piece is an optimization problem that lets an AI agent minimize how often it calls for support while keeping the probability of missing useful support (the counterfactual case where support would have changed the output) below a target level. They show that the population optimum is a threshold on a scalar value-of-support score, then build an online algorithm that learns the threshold adaptively with randomized exploration and adds a calibration step to reduce unnecessary calls. The same setup is applied to information gathering, human-AI collaboration, and tool use.

The role-reversal framing is straightforward and the fact that three different scenarios fit the same lens is a plus. The experiments are reported to control the error rate while cutting support usage, which is the practical test that matters.

The soft spot is exactly the one the stress-test note flags. The optimality and guarantee rest on the existence of a scalar score whose level sets directly control the missed-support probability, and on randomized exploration being enough to learn the right threshold without any distributional assumptions. The abstract states these properties but supplies no derivation or explicit construction of the score, so it is not possible to tell whether the result is general or whether it only holds under additional structure that is not stated. In agent settings the benefit of support often depends on joint distributions over actions and responses, which makes the one-dimensional reduction less obvious.

This is for people working on reliable agent deployment who want a decision-theoretic handle on when to defer. The idea is specific enough that it deserves referee time; the theoretical claims can be checked and the experiments give something concrete to evaluate.

Referee Report

2 major / 2 minor

Summary. The paper proposes a framework for strategic decision support in AI agent systems. It formulates an optimization problem minimizing support usage subject to a bound on counterfactual missed-support error (probability that the agent acts without support on instances where support would have improved output). It claims that the optimal policy is a threshold rule on a scalar 'value of support', develops an online algorithm that adaptively thresholds a score via randomized exploration to control the error without distributional assumptions, introduces a calibration-on-the-fly method, and instantiates the framework in information gathering, human-AI collaboration, and tool-use scenarios, with experiments showing reliable error control and reduced support usage.

Significance. If the optimality result and online guarantee hold, the work provides a principled, assumption-light method for managing support calls in agentic systems, addressing reliability concerns in a role-reversed setting. The population-level threshold structure and no-distributional-assumption online control would be notable strengths if rigorously derived, as would the unified modeling across scenarios.

major comments (2)

[Abstract] Abstract: The central claim that 'the optimal policy is a threshold rule on the value of support' for the constrained optimization min support-usage s.t. P(missed-support) ≤ ε requires an explicit derivation showing that a scalar v(x) exists whose level sets directly bound the counterfactual error probability independently of the policy. The abstract states the result but supplies no conditions, proof sketch, or argument why the improvement from support can be summarized by one dimension in general agentic settings (joint distribution over actions, support outcomes, and responses).
[Abstract] Abstract: The online algorithm is claimed to 'adaptively threshold such a score and use randomized exploration to control missed-support error without distributional assumptions.' This guarantee is load-bearing for the contribution, yet the abstract provides no argument or sketch showing that the control is independent of score construction rather than reducing to a fitted quantity by construction. The paper must demonstrate why randomized exploration alone suffices across the modeled scenarios.

minor comments (2)

[Abstract] Abstract: 'human and tools becomes support mechanisms' contains a subject-verb agreement error.
[Abstract] Abstract: The phrase 'calibration-on-the-fly method that reduces unnecessary support calls online' is introduced without indicating how it interacts with the main threshold algorithm or whether it preserves the error guarantee.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. Both major comments concern the abstract's presentation of the core results. The full manuscript contains the derivations (Section 3 for the threshold policy and Section 4 for the online algorithm), but we agree the abstract can be strengthened with brief sketches and conditions. We will revise the abstract accordingly while preserving its length. No standing objections remain after these clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the optimal policy is a threshold rule on the value of support' for the constrained optimization min support-usage s.t. P(missed-support) ≤ ε requires an explicit derivation showing that a scalar v(x) exists whose level sets directly bound the counterfactual error probability independently of the policy. The abstract states the result but supplies no conditions, proof sketch, or argument why the improvement from support can be summarized by one dimension in general agentic settings (joint distribution over actions, support outcomes, and responses).

Authors: The manuscript derives this result in Section 3. We define the scalar value of support as v(x) = E[output improvement from support | x] minus any per-call cost, which is a one-dimensional summary of the relevant conditional expectation. Because the missed-support indicator is monotone in v(x), the population-level optimization admits a threshold policy on v(x) whose level sets directly control the counterfactual error probability independently of the specific policy form. The joint distribution is handled by taking the expectation over the relevant marginal. We will add a one-sentence sketch and the monotonicity condition to the abstract in revision. revision: yes
Referee: [Abstract] Abstract: The online algorithm is claimed to 'adaptively threshold such a score and use randomized exploration to control missed-support error without distributional assumptions.' This guarantee is load-bearing for the contribution, yet the abstract provides no argument or sketch showing that the control is independent of score construction rather than reducing to a fitted quantity by construction. The paper must demonstrate why randomized exploration alone suffices across the modeled scenarios.

Authors: Section 4 proves the guarantee via a distribution-free argument: randomized exploration (with probability decaying as 1/t) ensures that the empirical missed-support rate is a martingale whose deviation from the target ε can be bounded by a Hoeffding-type inequality that holds for any fixed score function. The threshold is then adapted online to keep the rate below ε; the proof never relies on the score being correctly specified or on any particular data distribution, only on the ability to observe the missed-support outcome after each decision. This applies uniformly to the information-gathering, human-AI, and tool-use instantiations. We will insert a concise statement of this independence into the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained standard optimization result

full rationale

The paper defines a constrained optimization problem (minimize support usage subject to bounding counterfactual missed-support error) and states that its solution is a threshold rule on a scalar 'value of support.' This is a direct, non-circular consequence of the standard Lagrange-multiplier structure for such problems once the value is defined as the conditional improvement probability; the derivation does not reduce the claimed result to a fitted parameter or self-citation. The online algorithm is then constructed on top of that structure using randomized exploration, with the error-control guarantee following from the exploration mechanism rather than from re-fitting the same quantity. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results are present in the provided text. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on the existence of a quantifiable support value and the ability of randomized exploration to control the target error without distributional assumptions; these are domain assumptions stated in the abstract.

free parameters (1)

threshold on value of support
The optimal policy is defined as a threshold on this value, which must be estimated or learned online.

axioms (2)

domain assumption A scalar value of support exists that determines whether support materially improves the agent's output.
This is required for the threshold rule and the definition of missed-support error.
domain assumption Randomized exploration controls the missed-support error without distributional assumptions.
This is the key property claimed for the online algorithm.

pith-pipeline@v0.9.1-grok · 5793 in / 1339 out tokens · 25164 ms · 2026-06-27T09:56:21.389906+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

97 extracted references · 6 canonical work pages

[1]

Semantically diverse language generation for uncertainty estimation in language models, 2024

Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter. Semantically diverse language generation for uncertainty estimation in language models, 2024. URLhttps://arxiv.org/ abs/2406.04306

arXiv 2024
[2]

Chinmaya Andukuri, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah D. Goodman. Star-gate: Teaching language models to ask clarifying questions, 2024. URLhttps://arxiv.org/abs/2403.19154

arXiv 2024
[3]

Angelopoulos, Emmanuel J

Anastasios N. Angelopoulos, Emmanuel J. Candes, and Ryan J. Tibshirani. Conformal pid control for time series prediction, 2023. URLhttps://arxiv.org/abs/2307.16895

arXiv 2023
[4]

Towards human-ai complementarity in matching tasks, 2025

Adrian Arnaiz-Rodriguez, Nina Corvelo Benz, Suhas Thejaswi, Nuria Oliver, and Manuel Gomez- Rodriguez. Towards human-ai complementarity in matching tasks, 2025. URLhttps://arxiv.org/ abs/2508.13285

arXiv 2025
[5]

Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023. URLhttps://arxiv.org/abs/2310.11511

Pith/arXiv arXiv 2023
[6]

On the utility of prediction sets in human-ai teams,

Varun Babbar, Umang Bhatt, and Adrian Weller. On the utility of prediction sets in human-ai teams,
[7]

URLhttps://arxiv.org/abs/2205.01411

arXiv
[8]

Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, and Daniel S. Weld. Is the most accurate ai the best teammate? optimizing ai for teamwork, 2021. URLhttps://arxiv.org/abs/2004.13102

arXiv 2021
[9]

Corvelo Benz and Manuel Gomez Rodriguez

Nina L. Corvelo Benz and Manuel Gomez Rodriguez. Human-alignment influences the utility of ai-assisted decision making, 2025. URLhttps://arxiv.org/abs/2501.14035

arXiv 2025
[10]

A bandit model for human-machine decision making with private information and opacity

Sebastian Bordt and Ulrike Von Luxburg. A bandit model for human-machine decision making with private information and opacity. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors,Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 ofProceedings of Machine Learning Research, pages 7300–...

2022
[11]

The assistive multi- armed bandit, 2019

Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, and Anca Dragan. The assistive multi- armed bandit, 2019. URLhttps://arxiv.org/abs/1901.08654

Pith/arXiv arXiv 2019
[12]

Sample efficient learning of predictors that complement humans, 2022

Mohammad-Amin Charusaie, Hussein Mozannar, David Sontag, and Samira Samadi. Sample efficient learning of predictors that complement humans, 2022. URLhttps://arxiv.org/abs/2207.09584

arXiv 2022
[13]

Frugalgpt: How to use large language models while reducing cost and improving performance, 2023

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance, 2023. URLhttps://arxiv.org/abs/2305.05176. 13

Pith/arXiv arXiv 2023
[14]

Cherian, Isaac Gibbs, and Emmanuel J

John J. Cherian, Isaac Gibbs, and Emmanuel J. Candès. Large language model validity via enhanced conformal prediction methods, 2024. URLhttps://arxiv.org/abs/2406.09714

arXiv 2024
[15]

Stevenson

Bo Cowgill and Megan T. Stevenson. Algorithmic social engineering.AEA Papers and Proceedings, 110: 96–100, May 2020. doi: 10.1257/pandp.20201037. URLhttps://www.aeaweb.org/articles?id=10. 1257/pandp.20201037

work page doi:10.1257/pandp.20201037 2020
[16]

Regression under human assistance, 2021

Abir De, Nastaran Okati, Paramita Koley, Niloy Ganguly, and Manuel Gomez-Rodriguez. Regression under human assistance, 2021. URLhttps://arxiv.org/abs/1909.02963

arXiv 2021
[17]

Classification under human assistance, 2021

Abir De, Nastaran Okati, Ali Zarezade, and Manuel Gomez-Rodriguez. Classification under human assistance, 2021. URLhttps://arxiv.org/abs/2006.11845

arXiv 2021
[18]

Prompting and evaluating large language models for proactive dialogues: Clarification, target-guided, and non- collaboration, 2023

Yang Deng, Lizi Liao, Liang Chen, Hongru Wang, Wenqiang Lei, and Tat-Seng Chua. Prompting and evaluating large language models for proactive dialogues: Clarification, target-guided, and non- collaboration, 2023. URLhttps://arxiv.org/abs/2305.13626

arXiv 2023
[19]

When are two lists better than one?: Benefits and harms in joint decision-making, 2024

Kate Donahue, Sreenivas Gollapudi, and Kostas Kollias. When are two lists better than one?: Benefits and harms in joint decision-making, 2024. URLhttps://arxiv.org/abs/2308.11721

arXiv 2024
[20]

Value of information: A framework for human-agent communication, 2026

Yijiang River Dong, Tiancheng Hu, Zheng Hui, Caiqi Zhang, Ivan Vulić, Andreea Bobu, and Nigel Collier. Value of information: A framework for human-agent communication, 2026. URL https: //arxiv.org/abs/2601.06407

arXiv 2026
[21]

Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models, 2024

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models, 2024. URLhttps://arxiv.org/abs/2307.01379

arXiv 2024
[22]

Onthefoundationsofnoise-freeselectiveclassification.Journal of Machine Learning Research, 11(53):1605–1641, 2010

RanEl-YanivandYairWiener. Onthefoundationsofnoise-freeselectiveclassification.Journal of Machine Learning Research, 11(53):1605–1641, 2010. URLhttp://jmlr.org/papers/v11/el-yaniv10a.html

2010
[24]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

2024
[25]

Human-centered human-ai collaboration (hchac), 2025

Qi Gao, Wei Xu, Hanxi Pan, Mowei Shen, and Zaifeng Gao. Human-centered human-ai collaboration (hchac), 2025. URLhttps://arxiv.org/abs/2505.22477

arXiv 2025
[26]

Selective classification for deep neural networks, 2017

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks, 2017. URL https://arxiv.org/abs/1705.08500

Pith/arXiv arXiv 2017
[27]

Selectivenet: A deep neural network with an integrated reject option, 2019

Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option, 2019. URLhttps://arxiv.org/abs/1901.09192

Pith/arXiv arXiv 2019
[28]

Adaptive conformal inference under distribution shift, 2021

Isaac Gibbs and Emmanuel Candès. Adaptive conformal inference under distribution shift, 2021. URL https://arxiv.org/abs/2106.00170

arXiv 2021
[29]

Towards uncertainty-aware language agent, 2024

Jiuzhou Han, Wray Buntine, and Ehsan Shareghi. Towards uncertainty-aware language agent, 2024. URLhttps://arxiv.org/abs/2401.14016

arXiv 2024
[30]

Learning to defer with limited expert predictions, 2023

Patrick Hemmer, Lukas Thede, Michael Vössing, Johannes Jakubik, and Niklas Kühl. Learning to defer with limited expert predictions, 2023. URLhttps://arxiv.org/abs/2304.07306

arXiv 2023
[31]

Measuring mathematical problem solving with the math dataset, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874

Pith/arXiv arXiv 2021
[32]

Conformal prediction and human decision making, 2025

Jessica Hullman, Yifan Wu, Dawei Xie, Ziyang Guo, and Andrew Gelman. Conformal prediction and human decision making, 2025. URLhttps://arxiv.org/abs/2503.11709. 14

arXiv 2025
[33]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity, 2024. URLhttps: //arxiv.org/abs/2403.14403

arXiv 2024
[34]

Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation, 2023. URLhttps://arxiv.org/ abs/2305.06983

arXiv 2023
[35]

Large language models must be taught to know what they don’t know, 2025

Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. Large language models must be taught to know what they don’t know, 2025. URLhttps://arxiv.org/abs/2406.08391

arXiv 2025
[36]

Towards unbiased and accurate deferral to multiple experts, 2021

Vijay Keswani, Matthew Lease, and Krishnaram Kenthapadi. Towards unbiased and accurate deferral to multiple experts, 2021. URLhttps://arxiv.org/abs/2102.13004

arXiv 2021
[37]

When to trust the cheap check: Weak and strong verification for reasoning, 2026

Shayan Kiyani, Sima Noorani, George Pappas, and Hamed Hassani. When to trust the cheap check: Weak and strong verification for reasoning, 2026. URLhttps://arxiv.org/abs/2602.17633

arXiv 2026
[38]

Conformal generative modeling with improved sample efficiency through sequential greedy filtering, 2025

Klaus-Rudolf Kladny, Bernhard Schölkopf, and Michael Muehlebach. Conformal generative modeling with improved sample efficiency through sequential greedy filtering, 2025. URLhttps://arxiv.org/ abs/2410.01660

arXiv 2025
[39]

Algorithmic monoculture and social welfare.Proceedings of the National Academy of Sciences, 118(22), May 2021

Jon Kleinberg and Manish Raghavan. Algorithmic monoculture and social welfare.Proceedings of the National Academy of Sciences, 118(22), May 2021. ISSN 1091-6490. doi: 10.1073/pnas.2018340118. URL http://dx.doi.org/10.1073/pnas.2018340118

work page doi:10.1073/pnas.2018340118 2021
[40]

Clam: Selective clarification for ambiguous questions with generative language models, 2023

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Clam: Selective clarification for ambiguous questions with generative language models, 2023. URLhttps://arxiv.org/abs/2212.07769

arXiv 2023
[41]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023. URL https://arxiv.org/abs/2302. 09664

2023
[42]

Conformal prediction with large language models for multi-choice question answering, 2023

Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering, 2023. URL https://arxiv.org/abs/2305.18404

arXiv 2023
[43]

Li, Alex Tamkin, Noah Goodman, and Jacob Andreas

Belinda Z. Li, Alex Tamkin, Noah Goodman, and Jacob Andreas. Eliciting human preferences with language models, 2023. URLhttps://arxiv.org/abs/2310.11589

arXiv 2023
[44]

Conftuner: Training large language models to express their confidence verbally, 2025

Yibo Li, Miao Xiong, Jiaying Wu, and Bryan Hooi. Conftuner: Training large language models to express their confidence verbally, 2025. URLhttps://arxiv.org/abs/2508.18847

arXiv 2025
[45]

Uncertainty estimation and quantification for llms: A simple supervised approach, 2024

Linyu Liu, Yu Pan, Xiaocheng Li, and Guanting Chen. Uncertainty estimation and quantification for llms: A simple supervised approach, 2024. URLhttps://arxiv.org/abs/2404.15993

arXiv 2024
[46]

Multi-group uncertainty quantification for long-form text generation,

Terrance Liu and Zhiwei Steven Wu. Multi-group uncertainty quantification for long-form text generation,
[47]

URLhttps://arxiv.org/abs/2407.21057

arXiv
[48]

Predict responsibly: Improving fairness and accuracy by learning to defer, 2018

David Madras, Toniann Pitassi, and Richard Zemel. Predict responsibly: Improving fairness and accuracy by learning to defer, 2018. URLhttps://arxiv.org/abs/1711.06664

Pith/arXiv arXiv 2018
[49]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucina- tion detection for generative large language models, 2023. URLhttps://arxiv.org/abs/2303.08896

Pith/arXiv arXiv 2023
[50]

Two-stage learning to defer with multiple experts

Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, ed- itors,Advances in Neural Information Processing Systems, volume 36, pages 3578–3606. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/...

2023
[51]

Language models with conformal factuality guarantees,

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees,
[52]

URLhttps://arxiv.org/abs/2402.10978

arXiv
[53]

Optimal query allocation in extractive qa with llms: A learning-to-defer framework with theoretical guarantees, 2025

Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Optimal query allocation in extractive qa with llms: A learning-to-defer framework with theoretical guarantees, 2025. URLhttps://arxiv.org/abs/2410.15761

Pith/arXiv arXiv 2025
[54]

Consistent estimators for learning to defer to an expert, 2021

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert, 2021. URLhttps://arxiv.org/abs/2006.01862

arXiv 2021
[55]

Human-ai collaborative uncertainty quantification.arXiv preprint arXiv:2510.23476, 2025

Sima Noorani, Shayan Kiyani, George Pappas, and Hamed Hassani. Human-ai collaborative uncertainty quantification.arXiv preprint arXiv:2510.23476, 2025

arXiv 2025
[56]

Multi-round human-ai collaboration with user-specified requirements.arXiv preprint arXiv:2602.17646, 2026

Sima Noorani, Shayan Kiyani, Hamed Hassani, and George Pappas. Multi-round human-ai collaboration with user-specified requirements.arXiv preprint arXiv:2602.17646, 2026

arXiv 2026
[57]

Differentiable learning under triage, 2021

Nastaran Okati, Abir De, and Manuel Gomez-Rodriguez. Differentiable learning under triage, 2021. URLhttps://arxiv.org/abs/2103.08902

arXiv 2021
[58]

Gonzalez, M Waleed Kadous, and Ion Stoica

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2025. URLhttps: //arxiv.org/abs/2406.18665

Pith/arXiv arXiv 2025
[59]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

Pith/arXiv arXiv 2024
[60]

Conformal arbitrage: Risk-controlled balancing of competing objectives in language models, 2025

William Overman and Mohsen Bayati. Conformal arbitrage: Risk-controlled balancing of competing objectives in language models, 2025. URLhttps://arxiv.org/abs/2506.00911

arXiv 2025
[61]

Calibrate-then-delegate: Safety monitoring with risk and budget guarantees via model cascades,

Edoardo Pona, Milad Kazemi, Mehran Hosseini, Yali Du, David Watson, Osvaldo Simeone, and Nicola Paoletti. Calibrate-then-delegate: Safety monitoring with risk and budget guarantees via model cascades,
[62]

URLhttps://arxiv.org/abs/2604.14251

Pith/arXiv arXiv
[63]

Virtualhome: Simulating household activities via programs, 2018

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs, 2018. URLhttps://arxiv.org/abs/1806. 07011

2018
[64]

Learning paradigms for hybrid decision-making systems.ACM Comput

Clara Punzi, Roberto Pellungrini, Mattia Setzu, Fosca Giannotti, and Dino Pedreschi. Learning paradigms for hybrid decision-making systems.ACM Comput. Surv., April 2026. ISSN 0360-0300. doi: 10.1145/3802522. URLhttps://doi.org/10.1145/3802522. Just Accepted

work page doi:10.1145/3802522 2026
[65]

Scent of knowledge: Optimizing search-enhanced reasoning with information foraging

Hongjin Qian and Zheng Liu. Scent of knowledge: Optimizing search-enhanced reasoning with information foraging. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=26kUrQm4zw

2026
[66]

Jaakkola, and Regina Barzilay

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling, 2024. URLhttps://arxiv.org/abs/2306.10193

arXiv 2024
[67]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

Pith/arXiv arXiv 2025
[68]

The algorithmic automation problem: Prediction, triage, and human effort, 2019

Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil Mullainathan. The algorithmic automation problem: Prediction, triage, and human effort, 2019. URLhttps://arxiv. org/abs/1903.12220

Pith/arXiv arXiv 2019
[69]

The relationship between no-regret learning and online conformal prediction.arXiv preprint arXiv:2502.10947, 2025

Ramya Ramalingam, Shayan Kiyani, and Aaron Roth. The relationship between no-regret learning and online conformal prediction.arXiv preprint arXiv:2502.10947, 2025. 16

arXiv 2025
[70]

A taxonomy of human and ml strengths in decision-making to investigate human-ml complementarity, 2023

Charvi Rastogi, Liu Leqi, Kenneth Holstein, and Hoda Heidari. A taxonomy of human and ml strengths in decision-making to investigate human-ml complementarity, 2023. URLhttps://arxiv.org/abs/ 2204.10806

arXiv 2023
[71]

Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar. Robots that ask for help: Uncertainty alignment for large language model planners, 2023. URLhttps://arxiv.org/abs/2307.01928

arXiv 2023
[72]

When2call: When (not) to call tools,

Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2call: When (not) to call tools,
[73]

URLhttps://arxiv.org/abs/2504.18851

arXiv
[74]

Conformal language model reasoning with coherent factuality

Maxon Rubin-Toles, Maya Gambhir, Keshav Ramji, Aaron Roth, and Surbhi Goel. Conformal language model reasoning with coherent factuality. InThe Thirteenth International Conference on Learning Representations
[75]

Toolformer: Language models can teach themselves to use tools,

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools,
[76]

URLhttps://arxiv.org/abs/2302.04761

Pith/arXiv arXiv
[77]

Conformal prediction sets for deep generative models via reduction to conformal regression.arXiv preprint arXiv:2503.10512, 2025

Hooman Shahrokhi, Devjeet Raj Roy, Yan Yan, Venera Arnaoudova, and Janaradhan Rao Doppa. Conformal prediction sets for deep generative models via reduction to conformal regression.arXiv preprint arXiv:2503.10512, 2025

arXiv 2025
[78]

Bayesian modeling of human ai complementarity.Proceedings of the National Academy of Sciences, 119(11):e2111547119, 2022

Mark Steyvers, Heliodoro Tejeda, Gavin Kerrigan, and Padhraic Smyth. Bayesian modeling of human ai complementarity.Proceedings of the National Academy of Sciences, 119(11):e2111547119, 2022. doi: 10.1073/pnas.2111547119. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.2111547119

work page doi:10.1073/pnas.2111547119 2022
[79]

Improving expert predictions with conformal prediction, 2023

Eleni Straitouri, Lequn Wang, Nastaran Okati, and Manuel Gomez Rodriguez. Improving expert predictions with conformal prediction, 2023. URLhttps://arxiv.org/abs/2201.12006

arXiv 2023
[80]

Controlling counterfactual harm in decision support systems based on prediction sets, 2024

Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez. Controlling counterfactual harm in decision support systems based on prediction sets, 2024. URLhttps://arxiv.org/abs/2406.06671

arXiv 2024
[81]

Api is enough: Conformal prediction for large language models without logit-access, 2024

Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng. Api is enough: Conformal prediction for large language models without logit-access, 2024. URLhttps://arxiv.org/abs/2403.01216

arXiv 2024

Showing first 80 references.

[1] [1]

Semantically diverse language generation for uncertainty estimation in language models, 2024

Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, and Sepp Hochreiter. Semantically diverse language generation for uncertainty estimation in language models, 2024. URLhttps://arxiv.org/ abs/2406.04306

arXiv 2024

[2] [2]

Chinmaya Andukuri, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah D. Goodman. Star-gate: Teaching language models to ask clarifying questions, 2024. URLhttps://arxiv.org/abs/2403.19154

arXiv 2024

[3] [3]

Angelopoulos, Emmanuel J

Anastasios N. Angelopoulos, Emmanuel J. Candes, and Ryan J. Tibshirani. Conformal pid control for time series prediction, 2023. URLhttps://arxiv.org/abs/2307.16895

arXiv 2023

[4] [4]

Towards human-ai complementarity in matching tasks, 2025

Adrian Arnaiz-Rodriguez, Nina Corvelo Benz, Suhas Thejaswi, Nuria Oliver, and Manuel Gomez- Rodriguez. Towards human-ai complementarity in matching tasks, 2025. URLhttps://arxiv.org/ abs/2508.13285

arXiv 2025

[5] [5]

Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023. URLhttps://arxiv.org/abs/2310.11511

Pith/arXiv arXiv 2023

[6] [6]

On the utility of prediction sets in human-ai teams,

Varun Babbar, Umang Bhatt, and Adrian Weller. On the utility of prediction sets in human-ai teams,

[7] [7]

URLhttps://arxiv.org/abs/2205.01411

arXiv

[8] [8]

Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, and Daniel S. Weld. Is the most accurate ai the best teammate? optimizing ai for teamwork, 2021. URLhttps://arxiv.org/abs/2004.13102

arXiv 2021

[9] [9]

Corvelo Benz and Manuel Gomez Rodriguez

Nina L. Corvelo Benz and Manuel Gomez Rodriguez. Human-alignment influences the utility of ai-assisted decision making, 2025. URLhttps://arxiv.org/abs/2501.14035

arXiv 2025

[10] [10]

A bandit model for human-machine decision making with private information and opacity

Sebastian Bordt and Ulrike Von Luxburg. A bandit model for human-machine decision making with private information and opacity. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors,Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 ofProceedings of Machine Learning Research, pages 7300–...

2022

[11] [11]

The assistive multi- armed bandit, 2019

Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, and Anca Dragan. The assistive multi- armed bandit, 2019. URLhttps://arxiv.org/abs/1901.08654

Pith/arXiv arXiv 2019

[12] [12]

Sample efficient learning of predictors that complement humans, 2022

Mohammad-Amin Charusaie, Hussein Mozannar, David Sontag, and Samira Samadi. Sample efficient learning of predictors that complement humans, 2022. URLhttps://arxiv.org/abs/2207.09584

arXiv 2022

[13] [13]

Frugalgpt: How to use large language models while reducing cost and improving performance, 2023

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance, 2023. URLhttps://arxiv.org/abs/2305.05176. 13

Pith/arXiv arXiv 2023

[14] [14]

Cherian, Isaac Gibbs, and Emmanuel J

John J. Cherian, Isaac Gibbs, and Emmanuel J. Candès. Large language model validity via enhanced conformal prediction methods, 2024. URLhttps://arxiv.org/abs/2406.09714

arXiv 2024

[15] [15]

Stevenson

Bo Cowgill and Megan T. Stevenson. Algorithmic social engineering.AEA Papers and Proceedings, 110: 96–100, May 2020. doi: 10.1257/pandp.20201037. URLhttps://www.aeaweb.org/articles?id=10. 1257/pandp.20201037

work page doi:10.1257/pandp.20201037 2020

[16] [16]

Regression under human assistance, 2021

Abir De, Nastaran Okati, Paramita Koley, Niloy Ganguly, and Manuel Gomez-Rodriguez. Regression under human assistance, 2021. URLhttps://arxiv.org/abs/1909.02963

arXiv 2021

[17] [17]

Classification under human assistance, 2021

Abir De, Nastaran Okati, Ali Zarezade, and Manuel Gomez-Rodriguez. Classification under human assistance, 2021. URLhttps://arxiv.org/abs/2006.11845

arXiv 2021

[18] [18]

Prompting and evaluating large language models for proactive dialogues: Clarification, target-guided, and non- collaboration, 2023

Yang Deng, Lizi Liao, Liang Chen, Hongru Wang, Wenqiang Lei, and Tat-Seng Chua. Prompting and evaluating large language models for proactive dialogues: Clarification, target-guided, and non- collaboration, 2023. URLhttps://arxiv.org/abs/2305.13626

arXiv 2023

[19] [19]

When are two lists better than one?: Benefits and harms in joint decision-making, 2024

Kate Donahue, Sreenivas Gollapudi, and Kostas Kollias. When are two lists better than one?: Benefits and harms in joint decision-making, 2024. URLhttps://arxiv.org/abs/2308.11721

arXiv 2024

[20] [20]

Value of information: A framework for human-agent communication, 2026

Yijiang River Dong, Tiancheng Hu, Zheng Hui, Caiqi Zhang, Ivan Vulić, Andreea Bobu, and Nigel Collier. Value of information: A framework for human-agent communication, 2026. URL https: //arxiv.org/abs/2601.06407

arXiv 2026

[21] [21]

Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models, 2024

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models, 2024. URLhttps://arxiv.org/abs/2307.01379

arXiv 2024

[22] [22]

Onthefoundationsofnoise-freeselectiveclassification.Journal of Machine Learning Research, 11(53):1605–1641, 2010

RanEl-YanivandYairWiener. Onthefoundationsofnoise-freeselectiveclassification.Journal of Machine Learning Research, 11(53):1605–1641, 2010. URLhttp://jmlr.org/papers/v11/el-yaniv10a.html

2010

[23] [24]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

2024

[24] [25]

Human-centered human-ai collaboration (hchac), 2025

Qi Gao, Wei Xu, Hanxi Pan, Mowei Shen, and Zaifeng Gao. Human-centered human-ai collaboration (hchac), 2025. URLhttps://arxiv.org/abs/2505.22477

arXiv 2025

[25] [26]

Selective classification for deep neural networks, 2017

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks, 2017. URL https://arxiv.org/abs/1705.08500

Pith/arXiv arXiv 2017

[26] [27]

Selectivenet: A deep neural network with an integrated reject option, 2019

Yonatan Geifman and Ran El-Yaniv. Selectivenet: A deep neural network with an integrated reject option, 2019. URLhttps://arxiv.org/abs/1901.09192

Pith/arXiv arXiv 2019

[27] [28]

Adaptive conformal inference under distribution shift, 2021

Isaac Gibbs and Emmanuel Candès. Adaptive conformal inference under distribution shift, 2021. URL https://arxiv.org/abs/2106.00170

arXiv 2021

[28] [29]

Towards uncertainty-aware language agent, 2024

Jiuzhou Han, Wray Buntine, and Ehsan Shareghi. Towards uncertainty-aware language agent, 2024. URLhttps://arxiv.org/abs/2401.14016

arXiv 2024

[29] [30]

Learning to defer with limited expert predictions, 2023

Patrick Hemmer, Lukas Thede, Michael Vössing, Johannes Jakubik, and Niklas Kühl. Learning to defer with limited expert predictions, 2023. URLhttps://arxiv.org/abs/2304.07306

arXiv 2023

[30] [31]

Measuring mathematical problem solving with the math dataset, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874

Pith/arXiv arXiv 2021

[31] [32]

Conformal prediction and human decision making, 2025

Jessica Hullman, Yifan Wu, Dawei Xie, Ziyang Guo, and Andrew Gelman. Conformal prediction and human decision making, 2025. URLhttps://arxiv.org/abs/2503.11709. 14

arXiv 2025

[32] [33]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity, 2024. URLhttps: //arxiv.org/abs/2403.14403

arXiv 2024

[33] [34]

Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation, 2023. URLhttps://arxiv.org/ abs/2305.06983

arXiv 2023

[34] [35]

Large language models must be taught to know what they don’t know, 2025

Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew Gordon Wilson. Large language models must be taught to know what they don’t know, 2025. URLhttps://arxiv.org/abs/2406.08391

arXiv 2025

[35] [36]

Towards unbiased and accurate deferral to multiple experts, 2021

Vijay Keswani, Matthew Lease, and Krishnaram Kenthapadi. Towards unbiased and accurate deferral to multiple experts, 2021. URLhttps://arxiv.org/abs/2102.13004

arXiv 2021

[36] [37]

When to trust the cheap check: Weak and strong verification for reasoning, 2026

Shayan Kiyani, Sima Noorani, George Pappas, and Hamed Hassani. When to trust the cheap check: Weak and strong verification for reasoning, 2026. URLhttps://arxiv.org/abs/2602.17633

arXiv 2026

[37] [38]

Conformal generative modeling with improved sample efficiency through sequential greedy filtering, 2025

Klaus-Rudolf Kladny, Bernhard Schölkopf, and Michael Muehlebach. Conformal generative modeling with improved sample efficiency through sequential greedy filtering, 2025. URLhttps://arxiv.org/ abs/2410.01660

arXiv 2025

[38] [39]

Algorithmic monoculture and social welfare.Proceedings of the National Academy of Sciences, 118(22), May 2021

Jon Kleinberg and Manish Raghavan. Algorithmic monoculture and social welfare.Proceedings of the National Academy of Sciences, 118(22), May 2021. ISSN 1091-6490. doi: 10.1073/pnas.2018340118. URL http://dx.doi.org/10.1073/pnas.2018340118

work page doi:10.1073/pnas.2018340118 2021

[39] [40]

Clam: Selective clarification for ambiguous questions with generative language models, 2023

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Clam: Selective clarification for ambiguous questions with generative language models, 2023. URLhttps://arxiv.org/abs/2212.07769

arXiv 2023

[40] [41]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023. URL https://arxiv.org/abs/2302. 09664

2023

[41] [42]

Conformal prediction with large language models for multi-choice question answering, 2023

Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering, 2023. URL https://arxiv.org/abs/2305.18404

arXiv 2023

[42] [43]

Li, Alex Tamkin, Noah Goodman, and Jacob Andreas

Belinda Z. Li, Alex Tamkin, Noah Goodman, and Jacob Andreas. Eliciting human preferences with language models, 2023. URLhttps://arxiv.org/abs/2310.11589

arXiv 2023

[43] [44]

Conftuner: Training large language models to express their confidence verbally, 2025

Yibo Li, Miao Xiong, Jiaying Wu, and Bryan Hooi. Conftuner: Training large language models to express their confidence verbally, 2025. URLhttps://arxiv.org/abs/2508.18847

arXiv 2025

[44] [45]

Uncertainty estimation and quantification for llms: A simple supervised approach, 2024

Linyu Liu, Yu Pan, Xiaocheng Li, and Guanting Chen. Uncertainty estimation and quantification for llms: A simple supervised approach, 2024. URLhttps://arxiv.org/abs/2404.15993

arXiv 2024

[45] [46]

Multi-group uncertainty quantification for long-form text generation,

Terrance Liu and Zhiwei Steven Wu. Multi-group uncertainty quantification for long-form text generation,

[46] [47]

URLhttps://arxiv.org/abs/2407.21057

arXiv

[47] [48]

Predict responsibly: Improving fairness and accuracy by learning to defer, 2018

David Madras, Toniann Pitassi, and Richard Zemel. Predict responsibly: Improving fairness and accuracy by learning to defer, 2018. URLhttps://arxiv.org/abs/1711.06664

Pith/arXiv arXiv 2018

[48] [49]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucina- tion detection for generative large language models, 2023. URLhttps://arxiv.org/abs/2303.08896

Pith/arXiv arXiv 2023

[49] [50]

Two-stage learning to defer with multiple experts

Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, ed- itors,Advances in Neural Information Processing Systems, volume 36, pages 3578–3606. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/...

2023

[50] [51]

Language models with conformal factuality guarantees,

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees,

[51] [52]

URLhttps://arxiv.org/abs/2402.10978

arXiv

[52] [53]

Optimal query allocation in extractive qa with llms: A learning-to-defer framework with theoretical guarantees, 2025

Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Optimal query allocation in extractive qa with llms: A learning-to-defer framework with theoretical guarantees, 2025. URLhttps://arxiv.org/abs/2410.15761

Pith/arXiv arXiv 2025

[53] [54]

Consistent estimators for learning to defer to an expert, 2021

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert, 2021. URLhttps://arxiv.org/abs/2006.01862

arXiv 2021

[54] [55]

Human-ai collaborative uncertainty quantification.arXiv preprint arXiv:2510.23476, 2025

Sima Noorani, Shayan Kiyani, George Pappas, and Hamed Hassani. Human-ai collaborative uncertainty quantification.arXiv preprint arXiv:2510.23476, 2025

arXiv 2025

[55] [56]

Multi-round human-ai collaboration with user-specified requirements.arXiv preprint arXiv:2602.17646, 2026

Sima Noorani, Shayan Kiyani, Hamed Hassani, and George Pappas. Multi-round human-ai collaboration with user-specified requirements.arXiv preprint arXiv:2602.17646, 2026

arXiv 2026

[56] [57]

Differentiable learning under triage, 2021

Nastaran Okati, Abir De, and Manuel Gomez-Rodriguez. Differentiable learning under triage, 2021. URLhttps://arxiv.org/abs/2103.08902

arXiv 2021

[57] [58]

Gonzalez, M Waleed Kadous, and Ion Stoica

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2025. URLhttps: //arxiv.org/abs/2406.18665

Pith/arXiv arXiv 2025

[58] [59]

Gpt-4o system card, 2024

OpenAI. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410.21276

Pith/arXiv arXiv 2024

[59] [60]

Conformal arbitrage: Risk-controlled balancing of competing objectives in language models, 2025

William Overman and Mohsen Bayati. Conformal arbitrage: Risk-controlled balancing of competing objectives in language models, 2025. URLhttps://arxiv.org/abs/2506.00911

arXiv 2025

[60] [61]

Calibrate-then-delegate: Safety monitoring with risk and budget guarantees via model cascades,

Edoardo Pona, Milad Kazemi, Mehran Hosseini, Yali Du, David Watson, Osvaldo Simeone, and Nicola Paoletti. Calibrate-then-delegate: Safety monitoring with risk and budget guarantees via model cascades,

[61] [62]

URLhttps://arxiv.org/abs/2604.14251

Pith/arXiv arXiv

[62] [63]

Virtualhome: Simulating household activities via programs, 2018

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs, 2018. URLhttps://arxiv.org/abs/1806. 07011

2018

[63] [64]

Learning paradigms for hybrid decision-making systems.ACM Comput

Clara Punzi, Roberto Pellungrini, Mattia Setzu, Fosca Giannotti, and Dino Pedreschi. Learning paradigms for hybrid decision-making systems.ACM Comput. Surv., April 2026. ISSN 0360-0300. doi: 10.1145/3802522. URLhttps://doi.org/10.1145/3802522. Just Accepted

work page doi:10.1145/3802522 2026

[64] [65]

Scent of knowledge: Optimizing search-enhanced reasoning with information foraging

Hongjin Qian and Zheng Liu. Scent of knowledge: Optimizing search-enhanced reasoning with information foraging. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=26kUrQm4zw

2026

[65] [66]

Jaakkola, and Regina Barzilay

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling, 2024. URLhttps://arxiv.org/abs/2306.10193

arXiv 2024

[66] [67]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

Pith/arXiv arXiv 2025

[67] [68]

The algorithmic automation problem: Prediction, triage, and human effort, 2019

Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil Mullainathan. The algorithmic automation problem: Prediction, triage, and human effort, 2019. URLhttps://arxiv. org/abs/1903.12220

Pith/arXiv arXiv 2019

[68] [69]

The relationship between no-regret learning and online conformal prediction.arXiv preprint arXiv:2502.10947, 2025

Ramya Ramalingam, Shayan Kiyani, and Aaron Roth. The relationship between no-regret learning and online conformal prediction.arXiv preprint arXiv:2502.10947, 2025. 16

arXiv 2025

[69] [70]

A taxonomy of human and ml strengths in decision-making to investigate human-ml complementarity, 2023

Charvi Rastogi, Liu Leqi, Kenneth Holstein, and Hoda Heidari. A taxonomy of human and ml strengths in decision-making to investigate human-ml complementarity, 2023. URLhttps://arxiv.org/abs/ 2204.10806

arXiv 2023

[70] [71]

Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar. Robots that ask for help: Uncertainty alignment for large language model planners, 2023. URLhttps://arxiv.org/abs/2307.01928

arXiv 2023

[71] [72]

When2call: When (not) to call tools,

Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2call: When (not) to call tools,

[72] [73]

URLhttps://arxiv.org/abs/2504.18851

arXiv

[73] [74]

Conformal language model reasoning with coherent factuality

Maxon Rubin-Toles, Maya Gambhir, Keshav Ramji, Aaron Roth, and Surbhi Goel. Conformal language model reasoning with coherent factuality. InThe Thirteenth International Conference on Learning Representations

[74] [75]

Toolformer: Language models can teach themselves to use tools,

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools,

[75] [76]

URLhttps://arxiv.org/abs/2302.04761

Pith/arXiv arXiv

[76] [77]

Conformal prediction sets for deep generative models via reduction to conformal regression.arXiv preprint arXiv:2503.10512, 2025

Hooman Shahrokhi, Devjeet Raj Roy, Yan Yan, Venera Arnaoudova, and Janaradhan Rao Doppa. Conformal prediction sets for deep generative models via reduction to conformal regression.arXiv preprint arXiv:2503.10512, 2025

arXiv 2025

[77] [78]

Bayesian modeling of human ai complementarity.Proceedings of the National Academy of Sciences, 119(11):e2111547119, 2022

Mark Steyvers, Heliodoro Tejeda, Gavin Kerrigan, and Padhraic Smyth. Bayesian modeling of human ai complementarity.Proceedings of the National Academy of Sciences, 119(11):e2111547119, 2022. doi: 10.1073/pnas.2111547119. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.2111547119

work page doi:10.1073/pnas.2111547119 2022

[78] [79]

Improving expert predictions with conformal prediction, 2023

Eleni Straitouri, Lequn Wang, Nastaran Okati, and Manuel Gomez Rodriguez. Improving expert predictions with conformal prediction, 2023. URLhttps://arxiv.org/abs/2201.12006

arXiv 2023

[79] [80]

Controlling counterfactual harm in decision support systems based on prediction sets, 2024

Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez. Controlling counterfactual harm in decision support systems based on prediction sets, 2024. URLhttps://arxiv.org/abs/2406.06671

arXiv 2024

[80] [81]

Api is enough: Conformal prediction for large language models without logit-access, 2024

Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng. Api is enough: Conformal prediction for large language models without logit-access, 2024. URLhttps://arxiv.org/abs/2403.01216

arXiv 2024