Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

Hamed Khosravi; Xiaoming Huo

arxiv: 2605.20270 · v1 · pith:IKRKQWDInew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· stat.ML

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

Hamed Khosravi , Xiaoming Huo This is my paper

Pith reviewed 2026-05-21 07:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords conformal predictionanytime validityselective risk controlRLVRLLM deploymente-processrisk bounds

0 comments

The pith

Conformal Selective Acting wraps RLVR-trained LLMs to deliver anytime-pathwise selective risk control at every round.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Conformal Selective Acting as a deployment wrapper for specialist LLMs fine-tuned with reinforcement learning from verifiable rewards. It addresses the need for per-deployment safety certificates in regulated settings without waiting for long-run averages or pooling data across deployments. The method maintains a Ville-type e-process for each threshold on a Bonferroni grid and evaluates it against the RLVR filtration. Under the assumptions of predictable updates and isotonic-calibrated monotone risk, it establishes bounds on selective risk that hold pathwise at every time step. This enables operators to certify and release decisions with controlled error at each round while maintaining high release rates.

Core claim

Conformal Selective Acting fills an empty cell in the (test statistic, validity guarantee, deployment rule) framework by using a per-round wrapper that maintains a Ville-type e-process per threshold on a Bonferroni grid evaluated against the RLVR filtration. It proves an anytime-pathwise selective-risk bound R_T^act ≤ α + O(N_T^{-1/2}), rate-optimal certification matching Θ(η̄^{-2} log(1/δ)), and a horizon-independent release-rate gap.

What carries the argument

The central mechanism is the Conformal Selective Acting wrapper that constructs and evaluates a Ville-type e-process per threshold on a Bonferroni grid to achieve selective risk control.

If this is right

The selective risk stays bounded by alpha plus a term that decreases like one over square root of rounds.
Certification reaches the optimal rate in terms of average step size.
The release rate gap stays independent of the time horizon.
It satisfies pathwise validity and non-refusing deployment on all tested benchmark and shift cells.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This wrapper could apply to other online updating models that satisfy predictable updates.
Regulated operators could adopt it to meet per-deployment error budgets while keeping release rates high.
Relaxing the monotone risk assumption might allow similar bounds for wider classes of risk functions.

Load-bearing premise

The central claim relies on the model updates being predictable and the risk function being isotonic-calibrated and monotone with respect to the threshold.

What would settle it

Observing a sequence of decisions where the empirical selective risk exceeds alpha plus a term that shrinks like one over square root of the number of rounds, despite satisfying predictable updates and monotone risk, would falsify the bound.

Figures

Figures reproduced from arXiv: 2605.20270 by Hamed Khosravi, Xiaoming Huo.

**Figure 1.** Figure 1: CSA as a plug-and-play deployment wrapper. Top (blue): the RLVR system is unchanged. Middle (green): CSA computes a surrogate score, updates one e-process per candidate threshold, certifies when the e-process crosses δ −1 q , and gates release via the largest certified threshold. Bottom (red): the anytime-valid guarantee holds simultaneously for all T under predictable updates and monotone risk, without ex… view at source ↗

**Figure 2.** Figure 2: Empirical validation of the RLVR structural layer on the live LoRA experiment (K=8 SFT rounds, 100 MATH problems each). (a) Estimated oracle frontier q ⋆ k ; red dashed lines mark drop rounds (DT =3). (b) Per-round slack ξk; nonzero only at drop rounds. (c) Cumulative slack BT = Pξk ≈ 0.40. Verifying the structural conditions in practice. All structural conditions are empirically falsifiable from the deplo… view at source ↗

**Figure 3.** Figure 3: Per-benchmark, per-α phase-budget view. Each panel is a benchmark; x-axis is the 6-α grid; each point is one (method, α) cell with Risk on the y-axis and AR encoded by point size. The y=α diagonal (dotted) is the pathwise-safety boundary; the shaded region above is the violation zone. CSA stays strictly at or below the diagonal on every (benchmark, α) cell. F.2.2 Prompts and answer extraction Each benchmar… view at source ↗

**Figure 4.** Figure 4: Risk vs. nominal budget α across all 8 benchmarks (10 replications per cell). Dotted diagonal y = α: pathwise-safety boundary. Shaded red: violation zone. CSA is the only method whose Risk stays at or below the diagonal on every benchmark at every α [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

**Figure 5.** Figure 5: Safe-Coverage radar, one mini-radar per method with at least one nonzero vertex. Vertex at benchmark b is Cov at α ⋆ b if the method is pathwise-safe, else 0 [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

**Figure 6.** Figure 6: Live RLVR + online LoRA on Qwen2.5-Math-7B (MATH cell, T=4,000, 20 replications). Top: running selective risk Ract t for all ten methods; dashed: y=α; red: violation zone. Bottom: running action rate ARt restricted to methods whose final risk is at or below α. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗

read the original abstract

A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $\alpha$. The operator needs a safety certificate for this deployment's stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound $R_T^{\mathrm{act}}\le\alpha+O(N_T^{-1/2})$, (ii) rate-optimal certification matching $\Theta(\bar\eta^{-2}\log(1/\delta))$, and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ($480$ streams), sixteen adversarial distribution-shift cells ($160$ streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ($10{,}300$ rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CSA fills a gap with a per-threshold e-process for anytime selective risk on online RLVR streams, but the pathwise bound rests on an unverified isotonic monotonicity assumption.

read the letter

The main thing to know is that this paper supplies a deployment wrapper for anytime-valid selective risk control on non-exchangeable online LLM update streams. It uses a per-threshold Ville e-process on a Bonferroni grid, paired with a max-certified-threshold rule, to deliver pathwise guarantees without pooling across deployments or waiting for long-run averages. Under predictable updates and isotonic-calibrated monotone risk, they prove an anytime-pathwise selective-risk bound of the form R_T^act ≤ α + O(N_T^{-1/2}), rate-optimal certification, and a horizon-independent release-rate gap. The experiments run across eight specialist benchmarks (480 streams), sixteen adversarial-shift cells (160 streams), and five live Expert-Iteration RLVR cells with online LoRA updates over four base models (10,300 rounds). CSA is the only method among the ten compared that maintains pathwise validity and non-refusing deployment on every cell. That empirical breadth is the clearest strength; the construction applies standard conformal and e-process ingredients to a new (test statistic, validity guarantee, deployment rule) combination that prior work left empty. The soft spot is the central assumption that risk remains monotone and isotonic-calibrated when evaluated against the filtration induced by the online LoRA updates. The proofs invoke this for both the supermartingale property and the deployment rule, and the live-stream results rely on it, yet no diagnostics appear to confirm it holds on the actual rounds. If monotonicity breaks on even a modest fraction of updates, the e-process property fails and the claimed pathwise bound does not follow. That is the main place where additional evidence would tighten the argument. The rest of the math and citation pattern looks standard for the area with no visible circularity or free parameters. This is for operators who run local specialist models under per-deployment error budgets and need per-round certificates rather than marginal or long-run guarantees. A reader working on online conformal methods or regulated LLM deployment would get concrete value from the comparisons. It deserves a serious referee because the formal claims are specific and the test coverage is large enough to be informative, even if the monotonicity assumption needs more direct support.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Conformal Selective Acting (CSA), a per-round wrapper for RLVR-trained LLMs that maintains a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration, and deploys via a max-certified-threshold rule. Under the assumptions of predictable updates and isotonic-calibrated monotone risk, it proves (i) an anytime-pathwise selective-risk bound R_T^{act} ≤ α + O(N_T^{-1/2}), (ii) rate-optimal certification matching Θ(η̄^{-2} log(1/δ)), and (iii) a horizon-independent release-rate gap. Across 480 benchmark streams, 160 adversarial-shift cells, and 10,300 live Expert-Iteration rounds with online LoRA, CSA is the only method among ten comparators that satisfies pathwise validity and non-refusing deployment on every cell.

Significance. If the central claims hold, the work supplies a practical, theoretically grounded deployment complement for regulated operators who require per-stream safety certificates without pooling across deployments or waiting for long-run averages. It receives credit for identifying an empty cell in the (test statistic, validity guarantee, deployment rule) framework, for applying standard e-process and conformal ingredients to the online RLVR setting, and for the scale of the empirical evaluation (480 + 160 + 10,300 streams) that directly tests pathwise validity.

major comments (2)

[§3.2] §3.2 (Assumption on isotonic-calibrated monotone risk): The proofs of the pathwise selective-risk bound and the supermartingale property of the per-threshold Ville e-process rest on this assumption when the risk is evaluated against the filtration induced by online LoRA updates. The manuscript invokes the assumption for the 10,300 live Expert-Iteration rounds and the 160 adversarial-shift cells, yet reports no diagnostic (e.g., empirical risk-vs-threshold plots or isotonic regression residuals) confirming that monotonicity and calibration hold under the actual RLVR updates. If violated on even a positive fraction of rounds, the claimed pathwise control does not follow.
[Theorem 1 / §4.1] Theorem 1 / §4.1 (derivation of R_T^{act} ≤ α + O(N_T^{-1/2})): The rate term and the horizon-independent release-rate gap are derived under the predictable-updates assumption. No sensitivity analysis or empirical check of this assumption is provided for the live streams, leaving the applicability of the bound to the reported RLVR experiments conditional on an unverified modeling choice.

minor comments (2)

[§3] The Bonferroni grid construction and the precise definition of the max-certified-threshold rule would benefit from an explicit algorithmic pseudocode block in §3.
[Experiments section] Table captions for the 10,300-round experiments should list the four base models and three architecture families to allow direct replication of the reported coverage numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. The comments highlight important points regarding the empirical support for our modeling assumptions. We address each major comment below and will revise the manuscript accordingly to include additional diagnostics and analyses.

read point-by-point responses

Referee: [§3.2] §3.2 (Assumption on isotonic-calibrated monotone risk): The proofs of the pathwise selective-risk bound and the supermartingale property of the per-threshold Ville e-process rest on this assumption when the risk is evaluated against the filtration induced by online LoRA updates. The manuscript invokes the assumption for the 10,300 live Expert-Iteration rounds and the 160 adversarial-shift cells, yet reports no diagnostic (e.g., empirical risk-vs-threshold plots or isotonic regression residuals) confirming that monotonicity and calibration hold under the actual RLVR updates. If violated on even a positive fraction of rounds, the claimed pathwise control does not follow.

Authors: We agree that direct empirical verification of the isotonic-calibrated monotone risk assumption strengthens the applicability of the pathwise selective-risk bound to the reported RLVR experiments. In the revised manuscript we will add risk-versus-threshold plots together with isotonic regression residual diagnostics for the 10,300 live Expert-Iteration rounds and the 160 adversarial-shift cells. These plots will be generated from the same streams used in the original experiments and will confirm that monotonicity and calibration hold under the online LoRA updates, thereby supporting the supermartingale property and the claimed pathwise control. revision: yes
Referee: [Theorem 1 / §4.1] Theorem 1 / §4.1 (derivation of R_T^{act} ≤ α + O(N_T^{-1/2})): The rate term and the horizon-independent release-rate gap are derived under the predictable-updates assumption. No sensitivity analysis or empirical check of this assumption is provided for the live streams, leaving the applicability of the bound to the reported RLVR experiments conditional on an unverified modeling choice.

Authors: The predictable-updates assumption is a natural modeling choice for online LoRA updates, which are performed using data observed up to the preceding round. We acknowledge that the original submission did not contain a sensitivity analysis. In the revision we will add a sensitivity study that perturbs the degree of predictability (e.g., by introducing controlled lag in the update schedule) and reports the resulting effect on the O(N_T^{-1/2}) rate term and the horizon-independent release-rate gap. This analysis will demonstrate robustness while preserving the theoretical derivation under the stated assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds derived from e-process and conformal methods under explicit external assumptions

full rationale

The paper applies standard Ville-type e-processes and conformal risk control to the RLVR online-update setting. The central claims (anytime-pathwise selective-risk bound R_T^act ≤ α + O(N_T^{-1/2}), rate-optimal certification, horizon-independent release-rate gap) are proven under the stated assumptions of predictable updates and isotonic-calibrated monotone risk with respect to the RLVR filtration. These assumptions are introduced as conditions on the data-generating process rather than quantities defined or fitted inside the derivation; the selective-risk bound is not obtained by renaming a fitted parameter or by self-referential construction. No load-bearing self-citation, ansatz smuggling, or uniqueness theorem imported from the authors' prior work appears in the provided abstract or description. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions required for the filtration and the risk bound; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption predictable updates
Invoked when the method is evaluated against the RLVR filtration for online streams.
domain assumption isotonic-calibrated monotone risk
Required to obtain the selective-risk bound R_T^{act} ≤ α + O(N_T^{-1/2}).

pith-pipeline@v0.9.0 · 5886 in / 1326 out tokens · 33056 ms · 2026-05-21T07:27:18.838373+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound R_T^act ≤ α + O(N_T^{-1/2})
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 4 internal anchors

[1]

2025 , doi =

Guo, Daya and Yang, Dejian and Zhang, Haowei and others , journal =. 2025 , doi =

work page 2025
[2]

and Liu, Alisa and Dziri, Nouha and Lyu, Shane and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D

Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James V. and Liu, Alisa and Dziri, Nouha and Lyu, Shane and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D. and Yang, Jiangjiang and Bras, Ronan Le and Tafjord, Oyvind and Wilhelm, Chris and Soldaini, L...

work page 2025
[3]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

arXiv preprint arXiv:2501.12599 , year =. 2501.12599 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y. K. and Wu, Yu and Guo, Daya , journal =. 2024 , eprint =

work page 2024
[5]

2025 , eprint =

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and Lin, Xin and others , journal =. 2025 , eprint =

work page 2025
[6]

2025 , eprint =

Peng, Hao and Qi, Yunjia and Wang, Xiaozhi and Xu, Bin and Hou, Lei and Li, Juanzi , journal =. 2025 , eprint =

work page 2025
[7]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao , journal =. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , eprint =

work page 2025
[8]

Measuring Mathematical Problem Solving with the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving with the. 2021 , url =

work page 2021
[9]

2024 , eprint =

Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal =. 2024 , eprint =

work page 2024
[10]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

work page 2022
[11]

Advances in Neural Information Processing Systems , year =

Thinking Fast and Slow with Deep Learning and Tree Search , author =. Advances in Neural Information Processing Systems , year =

work page
[12]

Advances in Neural Information Processing Systems , year =

Active, Anytime-Valid Risk Controlling Prediction Sets , author =. Advances in Neural Information Processing Systems , year =

work page
[13]

Selective Conformal Risk Control

Selective Conformal Risk Control , author =. arXiv preprint arXiv:2512.12844 , year =. 2512.12844 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Journal of Machine Learning Research , volume =

Conformal Inference for Online Prediction with Arbitrary Distribution Shifts , author =. Journal of Machine Learning Research , volume =. 2024 , url =

work page 2024
[15]

Advances in Neural Information Processing Systems , year =

Localized Adaptive Risk Control , author =. Advances in Neural Information Processing Systems , year =

work page
[16]

Advances in Neural Information Processing Systems , volume =

Adaptive Conformal Inference Under Distribution Shift , author =. Advances in Neural Information Processing Systems , volume =. 2021 , url =

work page 2021
[17]

Proceedings of the 40th International Conference on Machine Learning (ICML) , series =

Improved Online Conformal Prediction via Strongly Adaptive Online Learning , author =. Proceedings of the 40th International Conference on Machine Learning (ICML) , series =. 2023 , publisher =

work page 2023
[18]

Proceedings of the 39th International Conference on Machine Learning (ICML) , series =

Learn Then Test: Calibrating Predictive Algorithms to Achieve Risk Control , author =. Proceedings of the 39th International Conference on Machine Learning (ICML) , series =. 2022 , publisher =

work page 2022
[19]

Proceedings of the 12th International Conference on Learning Representations (ICLR) , year =

Conformal Risk Control , author =. Proceedings of the 12th International Conference on Learning Representations (ICLR) , year =

work page
[20]

Proceedings of the 41st International Conference on Machine Learning (ICML) , series =

Language Models with Conformal Factuality Guarantees , author =. Proceedings of the 41st International Conference on Machine Learning (ICML) , series =. 2024 , publisher =

work page 2024
[21]

The Annals of Statistics , volume =

Conformal Prediction Beyond Exchangeability , author =. The Annals of Statistics , volume =. 2023 , doi =

work page 2023
[22]

International Conference on Learning Representations , year =

Conformal Language Modeling , author =. International Conference on Learning Representations , year =

work page
[23]

Advances in Neural Information Processing Systems , year =

Large Language Model Validity via Enhanced Conformal Prediction Methods , author =. Advances in Neural Information Processing Systems , year =

work page
[24]

2025 , url =

Wang, Zhiyuan and Wang, Qingni and Zhang, Yue and Chen, Tianlong and Zhu, Xiaofeng and Shi, Xiaoshuang and Xu, Kaidi , booktitle =. 2025 , url =

work page 2025
[25]

The Annals of Mathematical Statistics , volume =

Sequential Tests of Statistical Hypotheses , author =. The Annals of Mathematical Statistics , volume =. 1945 , publisher =

work page 1945
[26]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =

Estimating Means of Bounded Random Variables by Betting , author =. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =. 2024 , doi =

work page 2024
[27]

ACM Computing Surveys , volume =

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Yejin and Madotto, Andrea and Fung, Pascale , title =. ACM Computing Surveys , volume =. 2023 , doi =

work page 2023
[28]

Busch, Felix and Hoffmann, Lena and Rueger, Christopher and van Dijk, Elon H. C. and Kader, Rawen and Ortiz-Prado, Esteban and Makowski, Marcus R. and Saba, Luca and Hadamitzky, Martin and Kather, Jakob Nikolas and Truhn, Daniel and Cuocolo, Renato and Adams, Lisa C. and Bressem, Keno K. , title =. Communications Medicine , volume =. 2025 , doi =

work page 2025
[29]

Non-Exchangeable Conformal Risk Control , booktitle =

Ant. Non-Exchangeable Conformal Risk Control , booktitle =. 2024 , eprint =

work page 2024
[30]

Journal of the ACM , volume =

Distribution-Free, Risk-Controlling Prediction Sets , author =. Journal of the ACM , volume =. 2021 , doi =

work page 2021
[31]

Bao, Yajie and Huo, Yuyang and Ren, Haojie and Zou, Changliang , journal =

work page
[32]

Proceedings of the 37th International Conference on Machine Learning (ICML) , series =

Online Control of the False Coverage Rate and False Sign Rate , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , series =. 2020 , publisher =

work page 2020
[33]

Hu, Zirui and Zhang, Zheng and Wang, Yingjie and Rutkowski, Leszek and Tao, Dacheng , booktitle =

work page
[34]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[35]

Applied Sciences , volume =

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams , author =. Applied Sciences , volume =. 2021 , doi =

work page 2021
[36]

and Lu, Xinghua , booktitle =

Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William W. and Lu, Xinghua , booktitle =. 2019 , doi =

work page 2019
[37]

2021 , doi =

Zhu, Fengbin and Lei, Wenqiang and Huang, Youcheng and Wang, Chao and Zhang, Shuo and Lv, Jiancheng and Feng, Fuli and Chua, Tat-Seng , booktitle =. 2021 , doi =

work page 2021
[38]

Proceedings of EMNLP , year =

Lessons from Natural Language Inference in the Clinical Domain , author =. Proceedings of EMNLP , year =

work page
[39]

2021 , eprint =

Training Verifiers to Solve Math Word Problems , author =. 2021 , eprint =

work page 2021
[40]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =

Vilares, David and G. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =. 2019 , publisher =

work page 2019
[41]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , year =. Think You Have Solved Question Answering? Try. 1803.05457 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

and Henderson, Peter and Ho, Daniel E

Zheng, Lucia and Guha, Neel and Anderson, Brandon R. and Henderson, Peter and Ho, Daniel E. , booktitle =. When Does Pretraining Help?. 2021 , doi =

work page 2021
[43]

and Androutsopoulos, Ion and Katz, Daniel Martin and Aletras, Nikolaos , booktitle =

Chalkidis, Ilias and Jana, Abhik and Hartung, Dirk and Bommarito, Michael J. and Androutsopoulos, Ion and Katz, Daniel Martin and Aletras, Nikolaos , booktitle =. 2022 , doi =

work page 2022
[44]

arXiv preprint arXiv:2509.15279 (2025)

Liu, Chi and Li, Derek and Shu, Yan and Chen, Robin and Duan, Derek and Fang, Teng and Dai, Bryan , year =. 2509.15279 , archivePrefix=

work page arXiv
[45]

Fin-r1: A large language model for financial reasoning through reinforcement learning.arXiv preprint arXiv:2503.16252, 2025

Liu, Zhaowei and Guo, Xin and Yang, Zhi and Lou, Fangqi and Zeng, Lingfeng and Niu, Jinyi and Li, Mengping and Qi, Qi and Liu, Zhiqiang and Han, Yiyang and others , year =. 2503.16252 , archivePrefix=

work page arXiv
[46]

Colombo, Pierre and Pires, Telmo Pessoa and Boudiaf, Malik and Culver, Dominic and Melo, Rui and Corro, Caio and Martins, Andre F. T. and Esposito, Fabrizio and Raposo, Vera L. 2024 , eprint =

work page 2024
[47]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Yang, An and Zhang, Beichen and Hui, Binyuan and Gao, Bofei and Yu, Bowen and Li, Chengpeng and Liu, Dayiheng and Tu, Jianhong and Zhou, Jingren and Lin, Junyang and others , year =. 2409.12122 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

and Zhang, Hao and Stoica, Ion , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , doi =

work page 2023
[49]

arXiv preprint arXiv:2404.14779 , year =

Christophe, Cl. arXiv preprint arXiv:2404.14779 , year =. 2404.14779 , archivePrefix=

work page arXiv

[1] [1]

2025 , doi =

Guo, Daya and Yang, Dejian and Zhang, Haowei and others , journal =. 2025 , doi =

work page 2025

[2] [2]

and Liu, Alisa and Dziri, Nouha and Lyu, Shane and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D

Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James V. and Liu, Alisa and Dziri, Nouha and Lyu, Shane and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D. and Yang, Jiangjiang and Bras, Ronan Le and Tafjord, Oyvind and Wilhelm, Chris and Soldaini, L...

work page 2025

[3] [3]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

arXiv preprint arXiv:2501.12599 , year =. 2501.12599 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y. K. and Wu, Yu and Guo, Daya , journal =. 2024 , eprint =

work page 2024

[5] [5]

2025 , eprint =

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and Lin, Xin and others , journal =. 2025 , eprint =

work page 2025

[6] [6]

2025 , eprint =

Peng, Hao and Qi, Yunjia and Wang, Xiaozhi and Xu, Bin and Hou, Lei and Li, Juanzi , journal =. 2025 , eprint =

work page 2025

[7] [7]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao , journal =. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , eprint =

work page 2025

[8] [8]

Measuring Mathematical Problem Solving with the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving with the. 2021 , url =

work page 2021

[9] [9]

2024 , eprint =

Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal =. 2024 , eprint =

work page 2024

[10] [10]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

work page 2022

[11] [11]

Advances in Neural Information Processing Systems , year =

Thinking Fast and Slow with Deep Learning and Tree Search , author =. Advances in Neural Information Processing Systems , year =

work page

[12] [12]

Advances in Neural Information Processing Systems , year =

Active, Anytime-Valid Risk Controlling Prediction Sets , author =. Advances in Neural Information Processing Systems , year =

work page

[13] [13]

Selective Conformal Risk Control

Selective Conformal Risk Control , author =. arXiv preprint arXiv:2512.12844 , year =. 2512.12844 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Journal of Machine Learning Research , volume =

Conformal Inference for Online Prediction with Arbitrary Distribution Shifts , author =. Journal of Machine Learning Research , volume =. 2024 , url =

work page 2024

[15] [15]

Advances in Neural Information Processing Systems , year =

Localized Adaptive Risk Control , author =. Advances in Neural Information Processing Systems , year =

work page

[16] [16]

Advances in Neural Information Processing Systems , volume =

Adaptive Conformal Inference Under Distribution Shift , author =. Advances in Neural Information Processing Systems , volume =. 2021 , url =

work page 2021

[17] [17]

Proceedings of the 40th International Conference on Machine Learning (ICML) , series =

Improved Online Conformal Prediction via Strongly Adaptive Online Learning , author =. Proceedings of the 40th International Conference on Machine Learning (ICML) , series =. 2023 , publisher =

work page 2023

[18] [18]

Proceedings of the 39th International Conference on Machine Learning (ICML) , series =

Learn Then Test: Calibrating Predictive Algorithms to Achieve Risk Control , author =. Proceedings of the 39th International Conference on Machine Learning (ICML) , series =. 2022 , publisher =

work page 2022

[19] [19]

Proceedings of the 12th International Conference on Learning Representations (ICLR) , year =

Conformal Risk Control , author =. Proceedings of the 12th International Conference on Learning Representations (ICLR) , year =

work page

[20] [20]

Proceedings of the 41st International Conference on Machine Learning (ICML) , series =

Language Models with Conformal Factuality Guarantees , author =. Proceedings of the 41st International Conference on Machine Learning (ICML) , series =. 2024 , publisher =

work page 2024

[21] [21]

The Annals of Statistics , volume =

Conformal Prediction Beyond Exchangeability , author =. The Annals of Statistics , volume =. 2023 , doi =

work page 2023

[22] [22]

International Conference on Learning Representations , year =

Conformal Language Modeling , author =. International Conference on Learning Representations , year =

work page

[23] [23]

Advances in Neural Information Processing Systems , year =

Large Language Model Validity via Enhanced Conformal Prediction Methods , author =. Advances in Neural Information Processing Systems , year =

work page

[24] [24]

2025 , url =

Wang, Zhiyuan and Wang, Qingni and Zhang, Yue and Chen, Tianlong and Zhu, Xiaofeng and Shi, Xiaoshuang and Xu, Kaidi , booktitle =. 2025 , url =

work page 2025

[25] [25]

The Annals of Mathematical Statistics , volume =

Sequential Tests of Statistical Hypotheses , author =. The Annals of Mathematical Statistics , volume =. 1945 , publisher =

work page 1945

[26] [26]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =

Estimating Means of Bounded Random Variables by Betting , author =. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =. 2024 , doi =

work page 2024

[27] [27]

ACM Computing Surveys , volume =

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Yejin and Madotto, Andrea and Fung, Pascale , title =. ACM Computing Surveys , volume =. 2023 , doi =

work page 2023

[28] [28]

Busch, Felix and Hoffmann, Lena and Rueger, Christopher and van Dijk, Elon H. C. and Kader, Rawen and Ortiz-Prado, Esteban and Makowski, Marcus R. and Saba, Luca and Hadamitzky, Martin and Kather, Jakob Nikolas and Truhn, Daniel and Cuocolo, Renato and Adams, Lisa C. and Bressem, Keno K. , title =. Communications Medicine , volume =. 2025 , doi =

work page 2025

[29] [29]

Non-Exchangeable Conformal Risk Control , booktitle =

Ant. Non-Exchangeable Conformal Risk Control , booktitle =. 2024 , eprint =

work page 2024

[30] [30]

Journal of the ACM , volume =

Distribution-Free, Risk-Controlling Prediction Sets , author =. Journal of the ACM , volume =. 2021 , doi =

work page 2021

[31] [31]

Bao, Yajie and Huo, Yuyang and Ren, Haojie and Zou, Changliang , journal =

work page

[32] [32]

Proceedings of the 37th International Conference on Machine Learning (ICML) , series =

Online Control of the False Coverage Rate and False Sign Rate , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , series =. 2020 , publisher =

work page 2020

[33] [33]

Hu, Zirui and Zhang, Zheng and Wang, Yingjie and Rutkowski, Leszek and Tao, Dacheng , booktitle =

work page

[34] [34]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[35] [35]

Applied Sciences , volume =

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams , author =. Applied Sciences , volume =. 2021 , doi =

work page 2021

[36] [36]

and Lu, Xinghua , booktitle =

Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William W. and Lu, Xinghua , booktitle =. 2019 , doi =

work page 2019

[37] [37]

2021 , doi =

Zhu, Fengbin and Lei, Wenqiang and Huang, Youcheng and Wang, Chao and Zhang, Shuo and Lv, Jiancheng and Feng, Fuli and Chua, Tat-Seng , booktitle =. 2021 , doi =

work page 2021

[38] [38]

Proceedings of EMNLP , year =

Lessons from Natural Language Inference in the Clinical Domain , author =. Proceedings of EMNLP , year =

work page

[39] [39]

2021 , eprint =

Training Verifiers to Solve Math Word Problems , author =. 2021 , eprint =

work page 2021

[40] [40]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =

Vilares, David and G. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =. 2019 , publisher =

work page 2019

[41] [41]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , year =. Think You Have Solved Question Answering? Try. 1803.05457 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

and Henderson, Peter and Ho, Daniel E

Zheng, Lucia and Guha, Neel and Anderson, Brandon R. and Henderson, Peter and Ho, Daniel E. , booktitle =. When Does Pretraining Help?. 2021 , doi =

work page 2021

[43] [43]

and Androutsopoulos, Ion and Katz, Daniel Martin and Aletras, Nikolaos , booktitle =

Chalkidis, Ilias and Jana, Abhik and Hartung, Dirk and Bommarito, Michael J. and Androutsopoulos, Ion and Katz, Daniel Martin and Aletras, Nikolaos , booktitle =. 2022 , doi =

work page 2022

[44] [44]

arXiv preprint arXiv:2509.15279 (2025)

Liu, Chi and Li, Derek and Shu, Yan and Chen, Robin and Duan, Derek and Fang, Teng and Dai, Bryan , year =. 2509.15279 , archivePrefix=

work page arXiv

[45] [45]

Fin-r1: A large language model for financial reasoning through reinforcement learning.arXiv preprint arXiv:2503.16252, 2025

Liu, Zhaowei and Guo, Xin and Yang, Zhi and Lou, Fangqi and Zeng, Lingfeng and Niu, Jinyi and Li, Mengping and Qi, Qi and Liu, Zhiqiang and Han, Yiyang and others , year =. 2503.16252 , archivePrefix=

work page arXiv

[46] [46]

Colombo, Pierre and Pires, Telmo Pessoa and Boudiaf, Malik and Culver, Dominic and Melo, Rui and Corro, Caio and Martins, Andre F. T. and Esposito, Fabrizio and Raposo, Vera L. 2024 , eprint =

work page 2024

[47] [47]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Yang, An and Zhang, Beichen and Hui, Binyuan and Gao, Bofei and Yu, Bowen and Li, Chengpeng and Liu, Dayiheng and Tu, Jianhong and Zhou, Jingren and Lin, Junyang and others , year =. 2409.12122 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

and Zhang, Hao and Stoica, Ion , booktitle =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , doi =

work page 2023

[49] [49]

arXiv preprint arXiv:2404.14779 , year =

Christophe, Cl. arXiv preprint arXiv:2404.14779 , year =. 2404.14779 , archivePrefix=

work page arXiv