pith. machine review for the scientific record. sign in

arxiv: 2604.17912 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AI

Recognition: unknown

Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought

Ege Onur Taga, Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Samet Oymak

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningchain-of-thought reasoningmulti-attempt verificationgradient biasweighted rewardsGRPOVerification@K
0
0 comments X

The pith

A calibrated weighting strategy removes gradient bias when training reinforcement learning agents on multi-attempt chain-of-thought problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies reinforcement learning in a setting where a model can make up to K successive attempts at a problem, with each attempt allowed to build on the previous ones after receiving hard verifier feedback. It demonstrates that simply weighting attempts according to their pass or fail outcome produces biased gradient estimates when the goal is to maximize Verification@K, the probability of eventual success within the K attempts. The authors propose Calibrated Attempt-Level GRPO, which introduces a specific weighing scheme that yields unbiased gradients while keeping variance low. Their analysis shows how the choice of per-attempt rewards shapes the training trajectory and the final Verification@K score. Experiments on both synthetic and real tasks confirm that the calibrated method outperforms standard GRPO and naive weighting.

Core claim

By devising a weighing strategy for per-attempt rewards, Calibrated Attempt-Level GRPO produces unbiased gradient estimates for the Verification@K objective in multi-attempt chain-of-thought, allowing the model to learn effective correction behavior from successive verifier signals without the bias that arises from naive weighting.

What carries the argument

Calibrated Attempt-Level (CAL) GRPO, a reinforcement learning algorithm whose weighing strategy adjusts the contribution of each attempt's reward to produce unbiased gradients for the Verification@K objective.

If this is right

  • Training becomes stable enough to harness per-attempt rewards without introducing bias into the policy update.
  • Verification@K performance improves because the model learns to use earlier failures to guide later attempts.
  • The method applies across both synthetic tasks and real reasoning problems without requiring changes to the verifier.
  • Naive reward weighting is shown to be strictly inferior for this objective, providing a concrete baseline to surpass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same calibration idea could be tested in other sequential decision settings where partial successes occur before a final outcome.
  • If the unbiased property holds for larger K, it may support scaling test-time computation by allowing models to make more correction attempts during inference.
  • The approach might combine with other forms of process supervision beyond hard binary verifiers.
  • It raises the question of whether similar calibration is needed when optimizing other aggregate rewards in long-horizon reasoning.

Load-bearing premise

The proposed weighing strategy yields unbiased gradients in practice without hidden dependencies on the particular verifier or the distribution of attempt outcomes.

What would settle it

A controlled experiment on synthetic data that computes the exact expected gradient for Verification@K and shows that the gradients produced by CAL-GRPO differ systematically from it, or that Verification@K performance fails to improve over vanilla GRPO on held-out tasks.

Figures

Figures reproduced from arXiv: 2604.17912 by Ege Onur Taga, Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Samet Oymak.

Figure 1
Figure 1. Figure 1: We consider RLOO and GRPO methods that weigh individual attempts. TL is the standard approach that uses the final Ver@K reward for all attempts. NAL uses per-attempt pass/fail as weights. Proposed CAL approach calibrates naive weights to obtain unbiased low-variance gradients. summarize, we make the following contributions: i We introduce attempt-level variations of both RLOO and GRPO to facilitate finer-g… view at source ↗
Figure 2
Figure 2. Figure 2: Optimizing the Ver@2 objective on MATH with GRPO variants. We compare Trajectory-Level GRPO (TL-GRPO), Naive Attempt-Level GRPO (NAL-GRPO), and Calibrated Attempt-Level GRPO (CAL-GRPO). (a) CAL-GRPO achieves the best Ver@2 over training. (b) The improvement is primarily driven by stronger second-attempt correction, whereas NAL-GRPO emphasizes first-attempt success. (c) NAL-GRPO minimizes the number of atte… view at source ↗
Figure 3
Figure 3. Figure 3: Optimizing the Ver@4 and Ver@2 objectives on Markov Chain task with GRPO variants. We compare TL-GRPO, NAL-GRPO, and CAL-GRPO. Success is defined by reaching the absorbing terminal state with a valid shortest-length trajectory; the trap setting in (b) additionally fails any attempt that visits the trap state. CAL-GRPO consistently reaches higher accuracies in both settings and breaks the plateau earlier th… view at source ↗
Figure 4
Figure 4. Figure 4: Optimizing the Ver@2 objective on 5x5 Maze with GRPO variants. We compare Trajectory-Level GRPO (TL-GRPO), Naive Attempt-Level GRPO (NAL-GRPO), and Calibrated Attempt-Level GRPO (CAL-GRPO). (a) CAL-GRPO reaches higher Ver@2 accuracy earlier in training. (b) NAL-GRPO increases attempt-1 success, while CAL-GRPO strengthens attempt-2 correction. (c) NAL-GRPO uses the fewest attempts, effectively behaving like… view at source ↗
Figure 5
Figure 5. Figure 5: Additional metrics for Ver@2 training on MATH. (a) Response length dynamics per attempt. (b) Per-attempt reward variability used by normalization. (c) Learned future-only attempt reweighting in CAL-GRPO. B. Additional Experimental Results for Section 4 Discussion on the additional results. Figures 5, 6, 8, 10, 12 and 13 report additional diagnostics and ablations for the three estimators (TL-GRPO/NAL-GRPO/… view at source ↗
Figure 6
Figure 6. Figure 6: Additional metrics for Ver@2 training on 5x5 Maze task. (a) Response length dynamics per attempt. (b) Per-attempt reward variation used during the per attempt normalization. (c) Learned future-only attempt reweighting in CAL-GRPO. 0 5 10 15 20 25 30 35 40 Training step 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Validation Accuracy TL-GRPO NAL-GRPO CAL-GRPO (a) Ver@2 validation accuracy on 9×9 Maze across training; CA… view at source ↗
Figure 7
Figure 7. Figure 7: Optimizing the Ver@2 objective on 9×9 Maze with GRPO variants. We compare Trajectory-Level GRPO (TL-GRPO), Naive Attempt-Level GRPO (NAL-GRPO), and Calibrated Attempt-Level GRPO (CAL-GRPO). (a) CAL-GRPO attains the best Ver@2. (b) Improvements mainly come from higher attempt-1 success with competitive attempt-2 correction. (c) NAL/CAL reduce the average number of attempts compared to TL-GRPO. efficiency fo… view at source ↗
Figure 8
Figure 8. Figure 8: Additional metrics for Ver@2 training on 9×9 Maze task. (a) Response length dynamics per attempt. (b) Per-attempt reward variation used during the per-attempt normalization. (c) Learned future-only attempt reweighting in CAL-GRPO. 0 5 10 15 20 25 30 35 40 Training step 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Validation Accuracy TL-GRPO NAL-GRPO CAL-GRPO (a) Ver@4 validation accuracy on 9×9 Maze across training… view at source ↗
Figure 9
Figure 9. Figure 9: Optimizing the Ver@4 objective on 9×9 Maze with GRPO variants. We compare Trajectory-Level GRPO (TL-GRPO), Naive Attempt-Level GRPO (NAL-GRPO), and Calibrated Attempt-Level GRPO (CAL-GRPO). (a) CAL-GRPO attains the best Ver@4. (b) Improvements reflect stronger attempt-wise performance, with CAL-GRPO emphasizing correction on later attempts. (c) NAL/CAL reduce the average number of attempts compared to TL-G… view at source ↗
Figure 10
Figure 10. Figure 10: Additional metrics for Ver@4 training on 9×9 Maze task. (a) Response length dynamics per attempt. (b) Per-attempt reward variation used during the per-attempt normalization. (c) Learned future-only attempt reweighting in CAL-GRPO. yi ∼ πθ(· | si−1). Define the attempt-i score (for the whole attempt output) as gi := ∇θ log πθ(yi | si−1) · 1{T ≥ i}, i ∈ [K], 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Optimizing the Ver@4 objective on 7×7 Maze with GRPO variants. We compare Trajectory-Level GRPO (TL-GRPO), Naive Attempt-Level GRPO (NAL-GRPO), and Calibrated Attempt-Level GRPO (CAL-GRPO). (a) CAL-GRPO attains the best Ver@4. (b) Improvements reflect stronger attempt-wise performance, with CAL-GRPO emphasizing correction on later attempts. (c) NAL/CAL reduce the average number of attempts compared to TL-… view at source ↗
Figure 12
Figure 12. Figure 12: Additional metrics for Ver@4 training on 7×7 Maze task. (a) Response length dynamics per attempt. (b) Per-attempt reward variation used during the per-attempt normalization. (c) Learned future-only attempt reweighting in CAL-GRPO. so gi ≡ 0 on trajectories that terminate before attempt i. Let Fi := σ(x, y1,r1, . . . , yi ,ri) be the trajectory filtration up to attempt i. Define the Doob conditional succes… view at source ↗
Figure 13
Figure 13. Figure 13: Effect of rollout group size and standard deviation normalization on Ver@2 training for Maze. We compare GRPO group sizes N=10 and N=16 for TL-GRPO, NAL-GRPO, and CAL-GRPO, and also compare the standard GRPO normalization against a variant without standard deviation scaling. (a) Larger groups improve early Ver@2, with the most noticeable gain for CAL-GRPO. (b) Since CAL-GRPO estimates attempt-wise weights… view at source ↗
read the original abstract

State-of-the-art reasoning models utilize long chain-of-thought (CoT) to solve increasingly complex problems using more test-time computation. In this work, we explore a long CoT setting where the model makes up to K successive attempts at solving a problem, in which each attempt is allowed to build on earlier ones after the model receives a hard verifier feedback. This motivates RL methods that can harness per-attempt rewards by carefully weighting individual attempts. We study optimizing the Verification@K reward (the model succeeds by the K-th attempt) and show that naively weighing the attempts by their pass/fail results in biased gradients. We introduce Calibrated Attempt-Level (CAL) GRPO by devising a weighing strategy to obtain unbiased gradients while maintaining small variance. Our theory reveals how incorporating per-attempt rewards influence the training and the eventual Verification@K performance. Experiments, baselines, and ablations on synthetic and real data corroborate our theory and the benefits of CAL-GRPO over vanilla GRPO as well as naive weighting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper addresses RL for multi-attempt chain-of-thought reasoning where a model generates up to K successive attempts, each conditioned on hard verifier feedback from prior failures. It claims that naively weighting attempts by pass/fail outcomes produces biased policy gradients for the Verification@K objective, while the proposed Calibrated Attempt-Level (CAL) GRPO uses a specific weighing strategy to recover unbiased gradients with low variance. Theory is developed to show how per-attempt rewards shape training dynamics and final Verification@K performance; experiments and ablations on synthetic and real data are said to corroborate the theory and demonstrate gains over vanilla GRPO and naive weighting.

Significance. If the unbiased-gradient claim is correct and the weighting generalizes beyond the tested regimes, the work supplies a principled mechanism for incorporating sequential verifier signals into RL without introducing bias from conditional sampling. This could meaningfully improve training of long-CoT reasoning models that exploit multiple attempts, especially when test-time compute is budgeted by Verification@K. The explicit theory linking per-attempt rewards to final performance is a positive feature.

major comments (1)
  1. [Theory section (derivation of unbiased gradients)] The central unbiased-gradient claim (abstract and theory section) appears to rest on a weighting formula that must cancel the bias induced by sequential conditioning. Later attempts are generated only after explicit failures on earlier ones, so the sampling distribution is the product of successive failure probabilities under the current policy. The derivation should explicitly include (or prove equivalence to) an importance-weight correction proportional to the inverse of the cumulative survival probability up to that attempt; otherwise a residual covariance term between the policy gradient and the failure-probability path remains. Please provide the full expansion of the expectation under the actual generation process and show that the proposed CAL weights cancel all bias terms.
minor comments (2)
  1. [Abstract] The abstract states that 'theory reveals how incorporating per-attempt rewards influence the training,' yet does not indicate whether the final Verification@K performance bound is parameter-free or still depends on fitted quantities from the same data.
  2. [Experiments] Experiments section: the description of synthetic and real data, including any exclusion rules for failed trajectories or verifier error rates, should be expanded so that the unbiased-gradient claim can be reproduced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for emphasizing the need for a fully explicit derivation of the unbiased gradients. We address the major comment below and will revise the manuscript accordingly to strengthen the theoretical presentation.

read point-by-point responses
  1. Referee: The central unbiased-gradient claim (abstract and theory section) appears to rest on a weighting formula that must cancel the bias induced by sequential conditioning. Later attempts are generated only after explicit failures on earlier ones, so the sampling distribution is the product of successive failure probabilities under the current policy. The derivation should explicitly include (or prove equivalence to) an importance-weight correction proportional to the inverse of the cumulative survival probability up to that attempt; otherwise a residual covariance term between the policy gradient and the failure-probability path remains. Please provide the full expansion of the expectation under the actual generation process and show that the proposed CAL weights cancel all bias terms.

    Authors: We appreciate the referee's careful reading and agree that the derivation benefits from greater explicitness. In the current theory section we introduce the CAL weights precisely to correct for the sequential conditioning bias, where the probability of reaching attempt k equals the product of failure probabilities on attempts 1 through k-1. The weights are constructed as the reciprocal of this cumulative survival probability, functioning as the required importance-sampling correction. Nevertheless, we acknowledge that the full expansion of the expectation (showing cancellation of all covariance terms between the policy gradient and the failure-probability path) is not written out in the most transparent step-by-step form. In the revised manuscript we will add this explicit expansion in the theory section (and supporting appendix), demonstrating algebraically that the CAL weighting eliminates every bias term and recovers an unbiased estimator of the Verification@K gradient. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces CAL-GRPO as a novel weighting strategy claimed to yield unbiased gradients for Verification@K optimization in sequential multi-attempt CoT, supported by a separate theory section on per-attempt reward effects. No load-bearing step in the abstract or described claims reduces by construction to a fitted parameter renamed as prediction, a self-definitional loop, or a self-citation chain that substitutes for independent derivation. The central result is presented as an independent construction with external experimental validation on synthetic and real data, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are named. Standard RL assumptions (e.g., policy gradient validity) and the existence of a reliable hard verifier are implicit but not detailed.

pith-pipeline@v0.9.0 · 5497 in / 1113 out tokens · 28993 ms · 2026-05-10T05:29:04.614213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 3 canonical work pages

  1. [1]

    2023 , eprint=

    A Configurable Library for Generating and Manipulating Maze Datasets , author=. 2023 , eprint=

  2. [2]

    2021 , eprint=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

  3. [3]

    Heckel, M

    Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards , author=. arXiv preprint arXiv:2602.11128 , year=

  4. [4]

    International Conference on Learning Representations , year=

    Continuous chain of thought enables parallel exploration and reasoning , author=. International Conference on Learning Representations , year=

  5. [5]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  6. [6]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  7. [7]

    Hybridflow: A flexible and efficient rlhf framework

    Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , year=. HybridFlow: A Flexible and Efficient RLHF Framework , url=. doi:10.1145/3689031.3696075 , booktitle=

  8. [8]

    2025 , eprint=

    AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO , author=. 2025 , eprint=

  9. [9]

    2025 , eprint=

    Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models , author=. 2025 , eprint=

  10. [10]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  11. [11]

    , title =

    Williams, Ronald J. , title =. Machine Learning , volume =. 1992 , doi =

  12. [12]

    2024 , eprint=

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , author=. 2024 , eprint=

  13. [13]

    2024 , eprint =

    SGLang: Efficient Execution of Structured Language Model Programs , author =. 2024 , eprint =

  14. [14]

    2025 , eprint=

    Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients , author=. 2025 , eprint=

  15. [15]

    Try Again

    A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning , author=. 2025 , eprint=

  16. [16]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

  17. [17]

    2025 , eprint=

    Process Reinforcement through Implicit Rewards , author=. 2025 , eprint=

  18. [18]

    2025 , eprint=

    Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , author=. 2025 , eprint=

  19. [19]

    2025 , eprint=

    Reinforcement Learning for Reasoning in Large Language Models with One Training Example , author=. 2025 , eprint=

  20. [20]

    2025 , eprint=

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? , author=. 2025 , eprint=

  21. [21]

    2025 , eprint=

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs , author=. 2025 , eprint=

  22. [22]

    2025 , eprint=

    VinePPO: Refining Credit Assignment in RL Training of LLMs , author=. 2025 , eprint=

  23. [23]

    2022 , eprint=

    STaR: Bootstrapping Reasoning With Reasoning , author=. 2022 , eprint=

  24. [24]

    2023 , eprint=

    Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

  25. [25]

    2023 , eprint=

    Self-Refine: Iterative Refinement with Self-Feedback , author=. 2023 , eprint=

  26. [26]

    2024 , eprint=

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing , author=. 2024 , eprint=

  27. [27]

    2024 , eprint=

    Large Language Models Cannot Self-Correct Reasoning Yet , author=. 2024 , eprint=

  28. [28]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  29. [29]

    2023 , eprint=

    Let's Verify Step by Step , author=. 2023 , eprint=

  30. [30]

    2025 , eprint=

    Generative Verifiers: Reward Modeling as Next-Token Prediction , author=. 2025 , eprint=

  31. [31]

    2024 , eprint=

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters , author=. 2024 , eprint=

  32. [32]

    2023 , eprint=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , eprint=

  33. [33]

    2023 , eprint=

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. 2023 , eprint=

  34. [34]

    2025 , eprint=

    Process Reward Models That Think , author=. 2025 , eprint=

  35. [35]

    2024 , eprint=

    Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning , author=. 2024 , eprint=

  36. [36]

    2025 , eprint=

    The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning , author=. 2025 , eprint=

  37. [37]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  38. [38]

    2023 , eprint=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

  39. [39]

    2023 , eprint=

    Chain-of-Verification Reduces Hallucination in Large Language Models , author=. 2023 , eprint=

  40. [40]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  41. [41]

    2025 , eprint=

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization , author=. 2025 , eprint=

  42. [42]

    2022 , eprint=

    Solving math word problems with process- and outcome-based feedback , author=. 2022 , eprint=

  43. [43]

    2022 , eprint=

    Learning to summarize from human feedback , author=. 2022 , eprint=

  44. [44]

    2022 , eprint=

    Training language models to follow instructions with human feedback , author=. 2022 , eprint=

  45. [45]

    2018 , eprint=

    High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. 2018 , eprint=

  46. [46]

    Policy Gradient Methods for Reinforcement Learning with Function Approximation , url =

    Sutton, Richard S and McAllester, David and Singh, Satinder and Mansour, Yishay , booktitle =. Policy Gradient Methods for Reinforcement Learning with Function Approximation , url =

  47. [47]

    2025 , eprint=

    Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design , author=. 2025 , eprint=

  48. [48]

    2025 , eprint=

    Multi-Turn Code Generation Through Single-Step Rewards , author=. 2025 , eprint=

  49. [49]

    2025 , eprint=

    SALT: Step-level Advantage Assignment for Long-horizon Agents via Trajectory Graph , author=. 2025 , eprint=

  50. [50]

    2025 , eprint=

    Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization , author=. 2025 , eprint=

  51. [51]

    2026 , eprint=

    MURPHY: Multi-Turn GRPO for Self Correcting Code Generation , author=. 2026 , eprint=

  52. [52]

    2026 , eprint=

    Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents , author=. 2026 , eprint=

  53. [53]

    2025 , eprint=

    Group-in-Group Policy Optimization for LLM Agent Training , author=. 2025 , eprint=

  54. [54]

    Beyond Accuracy: A Policy Gradient Reweighting Approach for Pass@K Maximization in

    Sadegh Mahdavi and Muchen Li and Kaiwen Liu and Renjie Liao and Christos Thrampoulidis , booktitle=. Beyond Accuracy: A Policy Gradient Reweighting Approach for Pass@K Maximization in. 2025 , url=

  55. [55]

    Forty-second International Conference on Machine Learning , year=

    Test-Time Training Provably Improves Transformers as In-context Learners , author=. Forty-second International Conference on Machine Learning , year=