arxiv: 2604.01413 · v2 · submitted 2026-04-01 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Adaptive Stopping for Multi-Turn LLM Reasoning

Xiaofan Zhou , Huy Nguyen , Bo Yu , Chenxi Liu , Lu Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords conformal predictionmulti-turn reasoninglarge language modelsadaptive stoppingretrieval-augmented generationReActcoverage guaranteesquestion answering

0 comments

The pith

MiCP enables multi-turn LLM reasoning to stop adaptively while preserving formal coverage guarantees by allocating error budgets across turns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models increasingly use multi-turn pipelines such as adaptive RAG and ReAct to answer complex questions, yet deciding when to stop has relied on heuristics without reliability guarantees. MiCP extends conformal prediction to these settings by dividing the total error budget among reasoning turns, allowing the model to halt early when intermediate outputs support it. Experiments on single-hop and multi-hop question answering benchmarks show that the method meets the target coverage probability while lowering average turns, inference cost, and final prediction set size. A new joint metric tracks both coverage validity and answering efficiency. The approach targets high-stakes domains where extra turns raise latency and expense without sacrificing the chance that the true answer remains covered.

Core claim

MiCP is the first conformal prediction framework for multi-turn LLM reasoning. It allocates different error budgets across turns so that adaptive stopping decisions still deliver an overall coverage guarantee. When applied to adaptive RAG and ReAct agents, MiCP reaches the target coverage on single-hop and multi-hop QA benchmarks, reduces the number of turns, inference cost, and prediction set size, and introduces a metric that jointly evaluates coverage validity and answering efficiency.

What carries the argument

Multi-turn conformal prediction with per-turn error budget allocation that supports adaptive stopping while preserving overall coverage.

If this is right

MiCP achieves the target coverage on both single-hop and multi-hop question answering benchmarks.
The method reduces the number of turns, inference cost, and prediction set size relative to fixed-turn baselines.
Formal coverage guarantees now apply to adaptive multi-turn pipelines that previously used only heuristics.
A new metric jointly quantifies coverage validity and answering efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The budget-allocation idea could extend to other sequential LLM workflows such as multi-agent collaboration.
Task-specific tuning of per-turn budgets might further improve efficiency without harming coverage.
Similar adaptive rules could be tested with uncertainty methods other than conformal prediction.

Load-bearing premise

The adaptive stopping rule based on intermediate outputs preserves the exchangeability conditions that conformal prediction needs for valid coverage guarantees.

What would settle it

An experiment on a QA benchmark where the stopping rule systematically favors low-confidence turns and the resulting empirical coverage falls below the nominal level.

Figures

Figures reproduced from arXiv: 2604.01413 by Bo Yu, Chenxi Liu, Huy Nguyen, Lu Cheng, Xiaofan Zhou.

**Figure 3.** Figure 3: Empirical coverage rate vs. the target 1 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiCP splits error budgets across turns for early stopping in multi-turn LLM agents but the coverage guarantee under adaptive decisions still needs a martingale fix to hold up.

read the letter

Hey, the main thing to know about this paper is that it puts conformal prediction into adaptive multi-turn pipelines like RAG and ReAct by handing out different error budgets per turn so the system can quit early while claiming an overall coverage guarantee. That is the actual new piece, since prior LLM-CP work stayed with single outputs and fixed budgets. The experiments back the practical side: on single-hop and multi-hop QA benchmarks the method hits the target coverage, trims average turns, cuts inference cost, and shrinks prediction sets, and the new joint metric for coverage plus efficiency is a clean addition. Those results are useful for anyone running these agents in production where latency and spend matter. The soft spot is exactly the one the stress-test flags. Adaptive stopping makes the stopping time depend on earlier model outputs, which breaks the exchangeability that standard split-CP needs for marginal coverage. Splitting the alpha levels does not automatically restore that property, and the abstract gives no derivation or optional-stopping argument to show the guarantee survives. If the full paper has a clean martingale proof it would close the gap; otherwise the central claim stays plausible but unverified. This is work for people who build reliable agentic systems or apply conformal methods to sequential LLM calls. A serious editor should send it to referees because the problem is real and the empirical angle is already there, even if the theory needs tightening before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper proposes Multi-Turn Language Models with Conformal Prediction (MiCP), a framework extending conformal prediction to multi-turn LLM pipelines such as adaptive RAG and ReAct. MiCP splits error budgets across turns to permit adaptive early stopping while claiming to preserve an overall coverage guarantee. Experiments on single-hop and multi-hop QA benchmarks are said to show that target coverage is achieved alongside reductions in turns, inference cost, and prediction-set size; a new joint metric for coverage validity and efficiency is also introduced.

Significance. If the coverage guarantee survives adaptive stopping, MiCP would supply the first formal CP treatment of multi-turn LLM agents, addressing a practical gap in high-stakes applications where both reliability and cost matter. The empirical reductions in turns and set size, together with the new efficiency-coverage metric, would be useful contributions provided they rest on a sound theoretical foundation.

major comments (3)

[§3] §3 (MiCP Framework): The description of error-budget allocation across turns asserts an overall coverage guarantee, yet supplies neither a derivation of the per-turn thresholds nor a martingale/optional-stopping argument showing that exchangeability of nonconformity scores is preserved when stopping decisions depend on prior outputs. This is load-bearing for the central claim.
[§4] §4 (Experiments): The text states that target coverage is achieved on the reported benchmarks, but provides no calibration-set size, explicit nonconformity-score definition, or per-turn coverage breakdown; without these it is impossible to verify whether the empirical results actually support the claimed guarantee.
[§5] §5 (New Metric): The joint coverage-efficiency metric is introduced without a formal definition, invariance properties, or comparison to existing CP efficiency measures, making it difficult to assess whether it adds reproducible value.

minor comments (2)

[Abstract] Abstract: the new metric is mentioned but never named; adding its name would improve readability.
[Notation] Notation: the multi-turn process variables (e.g., stopping time, cumulative score) are introduced inconsistently between the method and experiment sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that the theoretical foundation, experimental details, and metric definition require strengthening for clarity and rigor. Below we respond point-by-point and indicate the revisions we will make in the next version of the manuscript.

read point-by-point responses

Referee: [§3] §3 (MiCP Framework): The description of error-budget allocation across turns asserts an overall coverage guarantee, yet supplies neither a derivation of the per-turn thresholds nor a martingale/optional-stopping argument showing that exchangeability of nonconformity scores is preserved when stopping decisions depend on prior outputs. This is load-bearing for the central claim.

Authors: We acknowledge that the original §3 presented the error-budget allocation at a high level without a complete formal derivation. In the revision we have added a new subsection that (i) explicitly derives the per-turn thresholds by sequentially partitioning the total miscoverage budget α across a maximum number of turns, and (ii) supplies a martingale argument based on the optional stopping theorem. Under the maintained assumption that nonconformity scores remain exchangeable conditional on the filtration generated by prior turns, the overall coverage guarantee is preserved at stopping time. We believe this addresses the load-bearing concern. revision: yes
Referee: [§4] §4 (Experiments): The text states that target coverage is achieved on the reported benchmarks, but provides no calibration-set size, explicit nonconformity-score definition, or per-turn coverage breakdown; without these it is impossible to verify whether the empirical results actually support the claimed guarantee.

Authors: We agree these details are necessary for verification. The revised manuscript now reports the exact calibration-set sizes (1,000 examples per benchmark), gives the precise nonconformity-score definition used (negative log-probability of the gold answer under the model), and adds a supplementary table showing per-turn empirical coverage together with the cumulative coverage at the adaptive stopping time. These additions confirm that the reported results align with the theoretical guarantee. revision: yes
Referee: [§5] §5 (New Metric): The joint coverage-efficiency metric is introduced without a formal definition, invariance properties, or comparison to existing CP efficiency measures, making it difficult to assess whether it adds reproducible value.

Authors: We have expanded §5 with a formal definition of the joint metric as the product of the coverage indicator and a normalized efficiency term (1 − |C| / |C_max|). We prove its invariance under monotone transformations of the nonconformity scores and include a direct comparison against the conventional average set-size metric and the efficiency-coverage Pareto curves from prior single-turn CP work. The new material demonstrates that the metric provides a compact, reproducible summary tailored to multi-turn adaptive settings. revision: yes

Circularity Check

0 steps flagged

MiCP coverage guarantee derived from standard split conformal prediction with explicit error allocation; no reduction to input by construction.

full rationale

The paper introduces MiCP by allocating per-turn error budgets (alpha_t) such that sum alpha_t = alpha, then applies standard conformal prediction at each stopping time. No equations are presented that define the coverage probability in terms of the stopping rule itself, nor are any parameters fitted to the test data and then relabeled as predictions. The validity argument rests on the marginal coverage property of conformal prediction under the maintained exchangeability assumption, which is stated as an assumption rather than derived from the method. Empirical results on external QA benchmarks are reported separately from the guarantee. No self-citations are used to justify uniqueness or to import an ansatz. The derivation is therefore self-contained and does not collapse to a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the standard conformal prediction coverage guarantee being preserved when error budgets are allocated across turns in an adaptive process; no free parameters or new entities are described in the abstract.

axioms (1)

standard math Conformal prediction supplies valid marginal coverage under exchangeability of the data and model outputs
Invoked implicitly when extending single-turn CP to the multi-turn adaptive setting.

pith-pipeline@v0.9.0 · 5541 in / 1292 out tokens · 38316 ms · 2026-05-13T22:17:56.732847+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Conformal prediction for natural language processing: A survey

Margarida Campos, Ant \'o nio Farinhas, Chrysoula Zerva, M \'a rio AT Figueiredo, and Andr \'e FT Martins. Conformal prediction for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 12: 0 1497--1516, 2024

work page 2024
[4]

Principled context engineering for rag: Statistical guarantees via conformal prediction

Debashish Chakraborty, Eugene Yang, Daniel Khashabi, Dawn Lawrie, and Kevin Duh. Principled context engineering for rag: Statistical guarantees via conformal prediction. arXiv preprint arXiv:2511.17908, 2025

work page arXiv 2025
[5]

Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models

Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models. CoRR, abs/2402.10612, 2024. doi:10.48550/ARXIV.2402.10612. URL https://doi.org/10.48550/arXiv.2402.10612

work page doi:10.48550/arxiv.2402.10612 2024
[6]

Elasticsearch

BV Elasticsearch. Elasticsearch. software], version, 6 0 (1), 2018

work page 2018
[7]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pp.\ 6609--6625, Barcelona, Spain (Online), 2020. International Committee on Computational Linguistics

work page 2020
[8]

Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In Kevin Duh, Helena G \' o mez - Adorno, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:...

work page doi:10.18653/v1/2024.naacl-long.389 2024
[9]

Active retrieval augmented generation

Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 7969--7992, Singapore, December 2023. Association ...

work page doi:10.18653/v1/2023.emnlp-main.495 2023
[10]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, Vancouver, Canada, 2017

work page 2017
[11]

Conformal prediction with large language models for multi-choice question answering

Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023

work page arXiv 2023
[12]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page 2019
[13]

TRAQ : Trustworthy retrieval augmented question answering via conformal prediction

Shuo Li, Sangdon Park, Insup Lee, and Osbert Bastani. TRAQ : Trustworthy retrieval augmented question answering via conformal prediction. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Paper...

work page doi:10.18653/v1/2024.naacl-long.210 2024
[14]

Quco-rag: Quantifying uncertainty from the pre-training corpus for dynamic retrieval-augmented generation

Dehai Min, Kailin Zhang, Tongtong Wu, and Lu Cheng. Quco-rag: Quantifying uncertainty from the pre-training corpus for dynamic retrieval-augmented generation. arXiv preprint arXiv:2512.19134, 2025

work page arXiv 2025
[15]

Language

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. arXiv preprint arXiv:2402.10978, 2024

work page arXiv 2024
[16]

Adaptive retrieval without self-knowledge? bringing uncertainty back home

Viktor Moskvoretskii, Maria Marina, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, and Alexander Panchenko. Adaptive retrieval without self-knowledge? bringing uncertainty back home. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1...

work page 2025
[17]

Conformal language modeling

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S Jaakkola, and Regina Barzilay. Conformal language modeling. arXiv preprint arXiv:2306.10193, 2023

work page arXiv 2023
[18]

Qwen3.5 : Towards native multimodal agents, February 2026

Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[19]

Contribution and performance of chatgpt and other large language models (llm) for scientific and research advancements: a double-edged sword

Nitin Liladhar Rane, Abhijeet Tawde, Saurabh P Choudhary, and Jayesh Rane. Contribution and performance of chatgpt and other large language models (llm) for scientific and research advancements: a double-edged sword. International Research Journal of Modernization in Engineering Technology and Science, 5 0 (10): 0 875--899, 2023

work page 2023
[20]

Conformal language model reasoning with coherent factuality

Maxon Rubin-Toles, Maya Gambhir, Keshav Ramji, Aaron Roth, and Surbhi Goel. Conformal language model reasoning with coherent factuality. arXiv preprint arXiv:2505.17126, 2025

work page arXiv 2025
[21]

Confident adaptive language modeling

Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35: 0 17456--17472, 2022

work page 2022
[22]

A tutorial on conformal prediction

Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9 0 (3), 2008

work page 2008
[23]

Analyzing uncertainty of llm-as-a-judge: Interval evaluations with conformal prediction, 2025

Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, and Jian Kang. Analyzing uncertainty of llm-as-a-judge: Interval evaluations with conformal prediction, 2025. URL https://arxiv.org/abs/2509.18658

work page arXiv 2025
[24]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36: 0 8634--8652, 2023

work page 2023
[25]

API is enough: Conformal prediction for large language models without logit-access

Jiayuan Su, Jing Luo, Hongwei Wang, and Lu Cheng. API is enough: Conformal prediction for large language models without logit-access. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 979--995, Miami, Florida, USA, November 2024 a . Association for Computational Linguis...

work page doi:10.18653/v1/2024.findings-emnlp.54 2024
[26]

Dragin: Dynamic retrieval augmented generation based on the real-time information needs of large language models

Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. Dragin: Dynamic retrieval augmented generation based on the information needs of large language models, 2024 b . URL https://arxiv.org/abs/2403.10081

work page arXiv 2024
[27]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \`e re, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

MuSiQue : Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022

work page 2022
[29]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jordan L. Boyd - Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),...

work page doi:10.18653/v1/2023.acl-long.557 2023
[30]

Large language models for robotics: Opportunities, challenges, and perspectives

Jiaqi Wang, Enze Shi, Huawen Hu, Chong Ma, Yiheng Liu, Xuhui Wang, Yincheng Yao, Xuan Liu, Bao Ge, and Shu Zhang. Large language models for robotics: Opportunities, challenges, and perspectives. Journal of Automation and Intelligence, 4 0 (1): 0 52--64, 2025

work page 2025
[31]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2018

work page 2018
[32]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

work page 2022
[33]

Seakr: Self-aware knowledge retrieval for adaptive retrieval augmented generation

Zijun Yao, Weijian Qi, Liangming Pan, Shulin Cao, Linmei Hu, Weichuan Liu, Lei Hou, and Juanzi Li. Seakr: Self-aware knowledge retrieval for adaptive retrieval augmented generation. CoRR, abs/2406.19215, 2024. doi:10.48550/ARXIV.2406.19215. URL https://doi.org/10.48550/arXiv.2406.19215

work page doi:10.48550/arxiv.2406.19215 2024
[34]

Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu

Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F. Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu. Benchmarking LLM s via uncertainty quantification. In The Thirty-eight Conference on NIPS Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=L0oSfTroNE

work page 2024
[35]

Conformal structured prediction

Botong Zhang, Shuo Li, and Osbert Bastani. Conformal structured prediction. arXiv preprint arXiv:2410.06296, 2024

work page arXiv 2024
[36]

Conformal prediction: A data perspective

Xiaofan Zhou, Baiting Chen, Yu Gui, and Lu Cheng. Conformal prediction: A data perspective. ACM Computing Surveys, 2025

work page 2025
[37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[38]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[39]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[40]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page