pith. sign in

arxiv: 2506.13351 · v3 · submitted 2025-06-16 · 💻 cs.CL · cs.AI· cs.LG

Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

Pith reviewed 2026-05-19 09:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords reinforcement learninglarge language modelsreasoning optimizationunverifiable taskstoken-level rewardsrubric gatingchain-of-thoughtvariance signal
0
0 comments X p. Extension

The pith

A constrained RL framework improves LLM training on unverifiable tasks by rewarding high-variance reasoning tokens and enforcing rubric gates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning for large language models is difficult on tasks such as scientific writing, medicine, legal contracts, and finance because correct answers cannot be checked automatically. The paper proposes to address this by computing a token-level Reasoning Reflection Reward that measures how certain the model is about a reference answer given its chain-of-thought prefix. It then gives extra weight to tokens whose certainty varies strongly across different rollouts, treating these as the most reflective of actual reasoning quality. A separate variance-based filter removes queries that lack enough signal for meaningful comparison, while rubric gates apply hard accept-or-reject rules to final outputs. If successful, this combination produces faster and more sample-efficient training than standard approaches while still satisfying task constraints.

Core claim

Direct Reasoning Optimization optimizes a dense token-level Reasoning Reflection Reward aligned with reasoning quality by measuring the model's token-level certainty of a reference answer under its chain-of-thought prefix and selectively emphasizing tokens with high cross-rollout variance, which are called reasoning-reflective tokens. The same variance signal filters out queries with insufficient comparative signal. Rubric-gating complements the reward by operationalizing task criteria as hard accept/reject checks on final answers. Across four datasets spanning scientific writing, medicine, legal contracts, and finance, the resulting framework outperforms strong baselines, achieves faster,更加

What carries the argument

Reasoning Reflection Reward (R3), which measures token-level certainty of a reference answer under a chain-of-thought prefix and selectively emphasizes tokens showing high variance across rollouts.

If this is right

  • Outperforms strong baselines on scientific writing, medicine, legal contracts, and finance datasets.
  • Achieves faster convergence during training.
  • Improves sample efficiency by discarding queries that lack sufficient variance signal.
  • Respects task feasibility constraints through hard rubric-gating checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The variance signal used to select tokens could be tested on additional domains with subjective evaluation criteria.
  • The method suggests a way to reduce reliance on large volumes of reference data by focusing training on uncertain reasoning steps.
  • Combining rubric gates with the token reward may help maintain safety properties when scaling to larger models.

Load-bearing premise

High cross-rollout variance in token certainty reliably marks reasoning-reflective tokens whose selective emphasis during training improves quality without introducing bias or discarding too many useful queries.

What would settle it

An ablation study on the same four datasets in which removing the variance-based token emphasis or the query filter produces performance no better than standard RL baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2506.13351 by Emre K{\i}c{\i}man, Kate Drakos Demopulos, Leonardo Nunes, Ranveer Chandra, Songwu Lu, Srinagesh Sharma, Swati Sharma, Tusher Chakraborty, Yifei Xu.

Figure 1
Figure 1. Figure 1: Illustrative example of Reasoning Reflection Reward ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Direct Reasoning Optimization ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ParaRev training insights. more faithful and efficient edits. Given the known length bias in LLM-based evaluators (Zheng et al., 2023), this improvement further reflects better alignment with human preference. R3 outperforms ROUGE-based rewards. Compared to the ROUGE-rewarded baseline, R3 yields a win rate improvement of 16.0% (GPT judge) and 20.7% (Claude judge). We observe that the ROUGE-trained model fr… view at source ↗
Figure 4
Figure 4. Figure 4: FinQA training insights. correctness-based rewards. Specifically, as shown in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its chain-of-thought (CoT) prefix, and selectively emphasizes tokens with high cross-rollout variance, which we call reasoning-reflective tokens, that would otherwise be diluted by the bulk of low-variance tokens. The same variance signal also drives a filter that discards queries with insufficient signal for comparative learning. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets spanning scientific writing, medicine, legal contracts, and finance, our framework outperforms strong baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Direct Reasoning Optimization (DRO), a constrained RL framework for training LLMs on unverifiable tasks. It defines a token-level dense Reasoning Reflection Reward (R3) that measures reference-answer certainty under CoT prefixes and upweights high cross-rollout variance tokens (termed reasoning-reflective), applies a variance-driven filter to discard low-signal queries, and adds rubric-gating as hard feasibility constraints on final answers. Experiments across four datasets (scientific writing, medicine, legal contracts, finance) claim outperformance over strong baselines, faster and more sample-efficient learning, and constraint adherence.

Significance. If the cross-rollout variance signal reliably isolates reasoning quality rather than noise or prompt artifacts, the method could advance reward design for RL on tasks without external verifiers by combining dense internal signals with hard constraints. This would be particularly relevant for domains like medicine and law where reference answers exist but step-wise verification is difficult. The approach's sample efficiency and explicit feasibility handling are potential strengths if the core mechanism holds.

major comments (3)
  1. [§3.2] §3.2 (R3 definition and variance computation): The selection of reasoning-reflective tokens relies on cross-rollout variance of token certainty under the model's own CoT prefix; this internal signal risks capturing sampling noise, temperature effects, or prompt fragility rather than reasoning quality, and no ablation isolating variance from these confounders (or correlating it with human reasoning annotations) is reported to support the central claim that upweighting these tokens drives the gains.
  2. [§4.1] §4.1 and §4.2 (experimental setup and results): The abstract and results claim outperformance across four datasets with faster learning, but provide no details on exact baselines, metrics, statistical significance tests, run variance, or controls for the query filter's effect; without these, it is unclear whether gains stem from the variance signal, rubric-gating, or post-hoc filtering of hard examples.
  3. [§3.3] §3.3 (query filter): The variance threshold for discarding low-signal queries is listed as a free parameter; this compounds the risk that reported improvements reflect selective removal of difficult but valid queries rather than improved learning from the R3 emphasis, and no sensitivity analysis or correlation with task difficulty is provided.
minor comments (2)
  1. [§3] Notation for R3 and variance terms is introduced without a clear summary table or equation reference list, making it hard to track definitions across sections.
  2. [§4] Figure captions for rollout variance visualizations could more explicitly state the number of rollouts per query and temperature settings used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, providing clarifications on our design choices and committing to revisions that add the requested analyses and details without altering the core claims of the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (R3 definition and variance computation): The selection of reasoning-reflective tokens relies on cross-rollout variance of token certainty under the model's own CoT prefix; this internal signal risks capturing sampling noise, temperature effects, or prompt fragility rather than reasoning quality, and no ablation isolating variance from these confounders (or correlating it with human reasoning annotations) is reported to support the central claim that upweighting these tokens drives the gains.

    Authors: We acknowledge that an internal variance signal computed from the model's own rollouts could be influenced by sampling noise or temperature. Our motivation for upweighting high-variance tokens is that they represent decision points where different reasoning paths diverge in certainty about the reference, which we argue captures reflective reasoning rather than uniform low-variance tokens. To address the concern directly, the revised manuscript will include an ablation that replaces the variance-based weighting with uniform token weighting and with low-variance selection, reporting the resulting performance differences on the same datasets. We will also add results across two temperatures to test robustness. A direct correlation with human reasoning annotations was not conducted in the original work due to annotation cost; we will explicitly note this as a limitation and propose it as future work. revision: partial

  2. Referee: [§4.1] §4.1 and §4.2 (experimental setup and results): The abstract and results claim outperformance across four datasets with faster learning, but provide no details on exact baselines, metrics, statistical significance tests, run variance, or controls for the query filter's effect; without these, it is unclear whether gains stem from the variance signal, rubric-gating, or post-hoc filtering of hard examples.

    Authors: We agree that the original submission lacked sufficient experimental detail. The revised manuscript will expand Sections 4.1 and 4.2 to list the precise baseline implementations and hyperparameters, specify all evaluation metrics, include statistical significance testing (e.g., paired t-tests or Wilcoxon tests), and report mean performance with standard deviation across five independent runs with different seeds. We will also add a dedicated ablation that trains without the query filter to isolate its contribution from the R3 reward and rubric-gating components. revision: yes

  3. Referee: [§3.3] §3.3 (query filter): The variance threshold for discarding low-signal queries is listed as a free parameter; this compounds the risk that reported improvements reflect selective removal of difficult but valid queries rather than improved learning from the R3 emphasis, and no sensitivity analysis or correlation with task difficulty is provided.

    Authors: The variance threshold is a hyperparameter chosen to retain queries that provide meaningful cross-rollout signal for the token-level reward. In the revision we will include a sensitivity analysis (in the appendix) that varies the threshold and reports downstream performance and the fraction of queries retained. We will further examine whether discarded queries correlate with simple proxies for difficulty such as reference length and domain-specific lexical complexity, and discuss the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a new token-level reward R3 constructed from the model's internal certainty under CoT prefixes and cross-rollout variance, then applies it within an RL framework alongside rubric-gating. This is a definitional design choice for the method rather than a derivation that reduces the claimed outperformance to the inputs by construction. Empirical results are reported on four external datasets (scientific writing, medicine, legal, finance) against baselines, with no equations, self-citations, or uniqueness theorems shown to make the central claims tautological or statistically forced. The variance-based token selection and query filter are methodological components whose effectiveness is tested externally rather than assumed by definition.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on two main domain assumptions about variance signaling useful reasoning and the feasibility of rubric constraints; no new physical entities are introduced but several tunable thresholds for variance and gating are implied.

free parameters (2)
  • variance threshold for reasoning-reflective tokens
    Determines which tokens receive emphasis and which queries are filtered; value must be chosen or tuned.
  • rubric acceptance criteria
    Hard accept/reject rules operationalized per task; specific thresholds or definitions are required.
axioms (2)
  • domain assumption Token-level certainty of a reference answer under the model's CoT prefix correlates with reasoning quality.
    Directly used to construct the R3 reward signal.
  • domain assumption Rubric-gating can be defined as hard constraints that improve feasibility without excessively limiting useful exploration.
    Required for the constraint enforcement component to function as intended.

pith-pipeline@v0.9.0 · 5762 in / 1440 out tokens · 55051 ms · 2026-05-19T09:46:22.628689+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 18 internal anchors

  1. [1]

    The pitfalls of next-token prediction,

    Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. arXiv preprint arXiv:2403.06963,

  2. [2]

    Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding

    Haolin Chen, Yihao Feng, Zuxin Liu, Weiran Yao, Akshara Prabhakar, Shelby Heinecke, Ricky Ho, Phil Mui, Silvio Savarese, Caiming Xiong, et al. Language models are hidden reasoners: Unlocking latent reasoning capabilities via self-rewarding. arXiv preprint arXiv:2411.04282 ,

  3. [3]

    Enhancing uncertainty modeling with semantic graph for hallucination detection

    Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, and Zheng Feng. Enhancing uncertainty modeling with semantic graph for hallucination detection. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 39, pp. 23586–23594, 2025a. Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Ho...

  4. [4]

    Think, prune, train, improve: Scaling reasoning without scaling models.arXiv preprint arXiv:2504.18116,

    Caia Costello, Simon Guo, Anna Goldie, and Azalia Mirhoseini. Think, prune, train, improve: Scaling reasoning without scaling models. arXiv preprint arXiv:2504.18116,

  5. [5]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306,

  6. [6]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Ying- han Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594,

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

  8. [8]

    Language model cascades: Token-level uncertainty and beyond.arXiv preprint arXiv:2404.10136,

    Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language model cascades: Token-level uncertainty and beyond.arXiv preprint arXiv:2404.10136,

  9. [9]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290,

  10. [10]

    Self-evolved reward learning for llms

    Chenghua Huang, Zhizhen Fan, Lu Wang, Fangkai Yang, Pu Zhao, Zeqi Lin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, and Qi Zhang. Self-evolved reward learning for llms. arXiv preprint arXiv:2411.00418,

  11. [11]

    Towards Reasoning in Large Language Models: A Survey

    13 Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403,

  12. [12]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,

  13. [13]

    s3: You don’t need that much data to train a search agent via rl

    Pengcheng Jiang, Xueqiang Xu, Jiacheng Lin, Jinfeng Xiao, Zifeng Wang, Jimeng Sun, and Ji- awei Han. s3: You don’t need that much data to train a search agent via rl. arXiv preprint arXiv:2505.14146,

  14. [14]

    Casimir: A cor- pus of scientific articles enhanced with multiple author-integrated revisions

    L´eane Jourdan, Florian Boudin, Nicolas Hernandez, and Richard Dufour. Casimir: A cor- pus of scientific articles enhanced with multiple author-integrated revisions. arXiv preprint arXiv:2403.00241,

  15. [15]

    Pararev: Building a dataset for scientific paragraph revision annotated with revision instruction

    L´eane Jourdan, Nicolas Hernandez, Richard Dufour, Florian Boudin, and Akiko Aizawa. Pararev: Building a dataset for scientific paragraph revision annotated with revision instruction. arXiv preprint arXiv:2501.05222,

  16. [16]

    Log probabilities are a reliable estimate of semantic plausibility in base and instruction-tuned language models

    Carina Kauf, Emmanuele Chersoni, Alessandro Lenci, Evelina Fedorenko, and Anna A Ivanova. Log probabilities are a reliable estimate of semantic plausibility in base and instruction-tuned language models. arXiv preprint arXiv:2403.14859,

  17. [17]

    Gemini 2.5: Our most intelligent AI model

    Kavukcuoglu, Koray. Gemini 2.5: Our most intelligent AI model. https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/ ,

  18. [18]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. T \” ulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124,

  19. [19]

    Metaspatial: Reinforcing 3d spatial reasoning in vlms for the metaverse.arXiv preprint arXiv:2503.18470, 2025

    Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. arXiv preprint arXiv:2503.18470, 3,

  20. [20]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025a. Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling. arXiv preprint a...

  21. [21]

    Microsoft

    Accessed: 2025-06-02. Microsoft. Microsoft 365 Copilot Tuning overview (preview),

  22. [22]

    s1: Simple test-time scaling

    URL https://learn.microsoft.com/en-us/copilot/microsoft-365/ copilot-tuning-overview. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand `es, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393,

  23. [23]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  24. [24]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  25. [25]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314,

  26. [26]

    Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains

    Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains. arXiv preprint arXiv:2503.23829,

  27. [27]

    Learning to chain-of-thought with jensen’s evidence lower bound

    Yunhao Tang, Sid Wang, and R ´emi Munos. Learning to chain-of-thought with jensen’s evidence lower bound. arXiv preprint arXiv:2503.19618,

  28. [28]

    A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation

    Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987,

  29. [29]

    Genius: A generalizable and purely unsupervised self-training framework for advanced reasoning

    Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, and Zhiyong Wu. Genius: A generalizable and purely unsupervised self-training framework for advanced reasoning. arXiv preprint arXiv:2504.08672, 2025a. Yifei Xu, Tusher Chakraborty, Emre Kıcıman, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, M...

  30. [30]

    LIMO: Less is More for Reasoning

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387,

  31. [31]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    15 Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

  32. [32]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does re- inforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837,

  33. [33]

    Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective

    Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, and Xipeng Qiu. Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective. arXiv preprint arXiv:2412.14135,

  34. [34]

    Automatic Chain of Thought Prompting in Large Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493,

  35. [35]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025a. Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards. arXiv preprint a...

  36. [36]

    Reinforcing general reasoning without verifiers

    Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, and Chao Du. Reinforcing general reasoning without verifiers. arXiv preprint arXiv:2505.21493,

  37. [37]

    TTRL: Test-Time Reinforcement Learning

    Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084,