pith. machine review for the scientific record. sign in

arxiv: 2605.06638 · v2 · submitted 2026-05-07 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:05 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords reinforcement learninglarge language modelslong-horizon reasoninglogical expressivenessscaling lawssynthetic benchmarkstransfer learning
0
0 comments X

The pith

Reinforcement learning overcomes LLM long-horizon reasoning limits when training uses more expressive logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors built ScaleLogic, a controlled synthetic environment where they can separately dial up the required reasoning depth and the complexity of the logical rules. Training LLMs with RL on these tasks produces a clean power-law relationship between compute used and reasoning depth, with the scaling exponent growing worse as the logic gains features like conjunction, disjunction, negation, and quantifiers. Models trained under the more expressive regimes transfer more effectively to downstream math and reasoning benchmarks, gaining up to ten points and using compute more efficiently. The results indicate that current shortcomings in chaining many reasoning steps are not fixed properties of the transformer architecture but respond to choices in training data and procedure.

Core claim

RL training compute T follows T ∝ D^γ (R² > 0.99) with respect to reasoning depth D; the exponent γ rises monotonically from 1.04 in simple implication logic to 2.60 in first-order logic that includes and, or, not, and universal quantification. More expressive training produces larger downstream gains (up to +10.66 points) and more compute-efficient transfer on mathematics and general reasoning benchmarks. The same power-law relation appears across multiple RL algorithms, and curriculum ordering further improves scaling efficiency.

What carries the argument

ScaleLogic, a synthetic logical-reasoning framework that independently controls proof-planning depth (horizon) and the expressiveness of the underlying logic, ranging from implication-only to full first-order logic with conjunction, disjunction, negation, and universal quantification.

If this is right

  • Training on logics with higher expressiveness produces both larger absolute gains and better compute efficiency on downstream tasks.
  • The scaling exponent increases with logical expressiveness, so deeper reasoning becomes disproportionately more expensive in richer logics.
  • Curriculum ordering during RL substantially reduces the compute needed to reach a given depth.
  • The power-law relationship between compute and depth is reproducible across different RL algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Synthetic data generators for LLM reasoning training should emphasize logical expressiveness rather than simply increasing depth.
  • If real-world tasks can be decomposed into similar depth-expressiveness coordinates, targeted RL curricula could be designed without requiring larger models.
  • The observed scaling may generalize to other structured domains such as planning or program synthesis where horizon and rule complexity can be varied independently.

Load-bearing premise

That gains measured on the synthetic ScaleLogic tasks and their transfer to standard benchmarks serve as a faithful proxy for the long-horizon reasoning problems that appear in real applications.

What would settle it

Finding that models trained to high performance on ScaleLogic still show no meaningful improvement on a broad suite of authentic, long-chain reasoning problems outside the synthetic setting would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.06638 by Abulhair Saparov, Guangchen Lan, Guanwen Qiu, Sipeng Zhang, Tianle Wang, Xinpeng Wei, Zhaoyang Wang.

Figure 1
Figure 1. Figure 1: Overview of SCALELOGIC. Each problem has B candidate proof trees, exactly one of which has a provable conclusion; the others are made unprovable by corrupting one axiom. The depth D controls proof depth. Left: Implication-only reasoning. Right: The most expressive logic setting (referred to as + Quantification in Section 3.2) combines conjunction, disjunction, negation, and universal quantification. 2 Rela… view at source ↗
Figure 2
Figure 2. Figure 2: Training cost scales as a power law with reasoning depth, with exponent γ governed by logical expressiveness. (a) Scatter points show the training steps T required to reach the 90% accuracy threshold as a function of reasoning depth D. Solid lines show power-law fits T ∝ D γ . (b) Fitted exponent γ increases monotonically when expressiveness increases from Implication-only (γ = 1.04) to + Quantification (γ… view at source ↗
Figure 3
Figure 3. Figure 3: Downstream transfer from synthetic reasoning training. (a) All settings outperform the base model, with richer logical settings producing larger and more sustained gains. (b) Average downstream performance across logical settings under fixed depth (D = 12) and fixed compute (∼ 100 steps). Both controls exhibit the same monotone trend: More expressive training settings yield stronger downstream performance.… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of training distribution and RL algorithm on scaling efficiency in the + Conjunction setting. Each curve aggregates three independent seeds; shading denotes ±1 standard deviation in log-space across seeds. (a) Curriculum training yields the lowest exponent (γ = 1.33), whereas difficult-only training yields the highest (γ = 2.36). (b) All RL algorithms follow power-law scaling (R 2 > 0.99), with expo… view at source ↗
Figure 5
Figure 5. Figure 5: Out-of-distribution generalization across reasoning depths. (a) Increasing the training depth consistently extends the range of solvable test depths. (b) OOD generalization remains bounded: even the models trained at the largest depths fall to random at Dtest/Dtrain ≈3. additional algorithms: base GRPO [Shao et al., 2024] (without DAPO extension) and GSPO [Zheng et al., 2025], a recent sequence-level polic… view at source ↗
Figure 6
Figure 6. Figure 6: Power-law (orange) vs. exponential (blue) fits for each expressiveness setting. view at source ↗
Figure 6
Figure 6. Figure 6: Power-law (orange) vs. exponential (blue) fits for each expressiveness setting. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Power-law scaling under 85% accuracy threshold (cf. main-text view at source ↗
Figure 7
Figure 7. Figure 7: Power-law scaling under 85% accuracy threshold (cf. main-text [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Power-law fits T(D) = a · D γ under four alternative compute measures, complementing view at source ↗
Figure 8
Figure 8. Figure 8: Power-law fits T(D) = a · D γ under four alternative compute measures, complementing [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of candidate count B at fixed depth D = 8 under the + Quantification setting. (a) Training steps to 90% accuracy follow a power law in B (γB = 1.41, R 2 = 0.984, ∆AIC = +7.0 vs. exponential). (b) Average downstream performance across the eight reasoning benchmarks of Section 4.3, plotted against candidate count B. Gains saturate quickly: performance rises from 52.5% at B = 2 to 55.7% at B = 4 (+3.2 … view at source ↗
Figure 9
Figure 9. Figure 9: Effect of candidate count B at fixed depth D = 8 under the + Quantification setting. (a) Training steps to 90% accuracy follow a power law in B (γB = 1.41, R 2 = 0.984, ∆AIC = +7.0 vs. exponential). (b) Average downstream performance across the eight reasoning benchmarks of Section 4.3, plotted against candidate count B. Gains saturate quickly: performance rises from 52.5% at B = 2 to 55.7% at B = 4 (+3.2 … view at source ↗
Figure 10
Figure 10. Figure 10: Uniform vs. curriculum training under the most expressive view at source ↗
Figure 10
Figure 10. Figure 10: Uniform vs. curriculum training under the most expressive [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training trajectories on + Conjunction under three data distributions. Each row: response length (left), entropy (middle), validation accuracy (right); legend (training depth D) is shared across the three sub-panels. Note that axis ranges may differ across rows to improve readability. The scaling behavior at 8B closely mirrors that at 4B. All five expressiveness settings continue to follow a clean power l… view at source ↗
Figure 11
Figure 11. Figure 11: Training trajectories on + Conjunction under three data distributions. Each row: response length (left), entropy (middle), validation accuracy (right); legend (training depth D) is shared across the three sub-panels. Note that axis ranges may differ across rows to improve readability. The scaling behavior at 8B closely mirrors that at 4B. All five expressiveness settings continue to follow a clean power l… view at source ↗
Figure 12
Figure 12. Figure 12: Cross-scale replication on Qwen3-8B. (a) Training steps to convergence vs. reasoning depth across five expressiveness levels on log-log axes. Solid lines show power-law fits T ∝ D γ . (b) Fitted γ increases monotonically when expressiveness increases from Implication-only (γ = 0.99) to the + Quantification setting (γ = 2.53), mirroring the 4B picture in view at source ↗
Figure 12
Figure 12. Figure 12: Cross-scale replication on Qwen3-8B. (a) Training steps to convergence vs. reasoning depth across five expressiveness levels on log-log axes. Solid lines show power-law fits T ∝ D γ . (b) Fitted γ increases monotonically when expressiveness increases from Implication-only (γ = 0.99) to the + Quantification setting (γ = 2.53), mirroring the 4B picture in [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Training trajectories for all five settings. Each row: response length (left), entropy (middle), validation view at source ↗
Figure 13
Figure 13. Figure 13: Training trajectories for all five settings. Each row: response length (left), entropy (middle), validation [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Number of correct answers out of 30 for frontier LLMs on view at source ↗
Figure 14
Figure 14. Figure 14: Number of correct answers out of 30 for frontier LLMs on [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: MATH-500 #80: full reasoning trace of our model (top, view at source ↗
Figure 15
Figure 15. Figure 15: MATH-500 #80: full reasoning trace of our model (top, [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
read the original abstract

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. Observed LLM shortcomings in long-horizon reasoning have raised the prospect that these shortcomings are fundamental to the autoregressive transformer architecture. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^{\gamma}$, $R^{2} > 0.99$), and that the scaling exponent $\gamma$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency. More broadly, our results demonstrate that LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture, and can be addressed by improved training methodology and data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces ScaleLogic, a synthetic logical reasoning environment that independently varies proof depth D (horizon) and logical expressiveness (from implication-only to full first-order logic with and/or/not/forall). It reports that RL training compute T obeys a power law T ∝ D^γ with R² > 0.99, where the exponent γ rises monotonically from 1.04 to 2.60 as expressiveness increases; more expressive curricula produce larger downstream gains (up to +10.66 points) on math and general reasoning benchmarks and more efficient transfer. The authors conclude that long-horizon reasoning deficits in LLMs are not fundamental to the autoregressive architecture and can be mitigated by improved training methodology and data.

Significance. If the scaling and transfer results are robust, the work supplies a controlled testbed for dissecting how horizon and logical complexity interact with RL compute, showing that what the model is trained on (expressiveness) matters as much as how much it is trained. The use of multiple RL algorithms and curriculum variants, together with the clean power-law fits, offers a reproducible template for studying reasoning scaling that is currently rare in the LLM literature.

major comments (3)
  1. [Abstract] Abstract: the central claim that 'LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture' rests on ScaleLogic results transferring to downstream benchmarks, yet the abstract provides no information on which specific benchmarks were used, the number of evaluation runs, or whether the +10.66 point gains survive multiple-comparison correction; without these details the transfer evidence cannot be evaluated as load-bearing support for the architectural conclusion.
  2. [Abstract] Abstract: the reported R² > 0.99 for the power-law fits T ∝ D^γ is presented without stating the range of D values, number of data points per fit, or any data-exclusion criteria; because the monotonic rise in γ with expressiveness is the key quantitative result used to argue that training methodology overcomes horizon limits, the absence of these experimental controls makes the scaling claim difficult to assess.
  3. [Abstract] The manuscript's conclusion that improved RL and data suffice to address long-horizon deficits assumes ScaleLogic's closed, unambiguous entailment tasks are a faithful proxy; however, the framework lacks natural-language ambiguity, partial observability, and open-ended goal specification that characterize many real-world long-horizon problems. A direct test—evaluating whether the same γ scaling and transfer pattern appears on tasks that inject these features—would be required to substantiate the architectural claim.
minor comments (1)
  1. [Abstract] The abstract states that the power-law relationship 'holds across multiple RL methods' but does not name the methods or report per-method γ values; adding this information would improve clarity without altering the central argument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful review and valuable suggestions. We have made revisions to the abstract to address the concerns about missing details. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture' rests on ScaleLogic results transferring to downstream benchmarks, yet the abstract provides no information on which specific benchmarks were used, the number of evaluation runs, or whether the +10.66 point gains survive multiple-comparison correction; without these details the transfer evidence cannot be evaluated as load-bearing support for the architectural conclusion.

    Authors: We agree with this observation. The abstract has been revised to include information on the specific benchmarks used for downstream evaluation and to note that the gains are based on multiple independent runs. Details on statistical significance and any corrections are provided in the main text. revision: yes

  2. Referee: [Abstract] Abstract: the reported R² > 0.99 for the power-law fits T ∝ D^γ is presented without stating the range of D values, number of data points per fit, or any data-exclusion criteria; because the monotonic rise in γ with expressiveness is the key quantitative result used to argue that training methodology overcomes horizon limits, the absence of these experimental controls makes the scaling claim difficult to assess.

    Authors: We agree that the abstract should specify the experimental controls for the power-law fits. We have revised the abstract to include the range of D values used and the number of data points per fit. No data points were excluded in the fits, as described in the methods section of the manuscript. revision: yes

  3. Referee: [Abstract] The manuscript's conclusion that improved RL and data suffice to address long-horizon deficits assumes ScaleLogic's closed, unambiguous entailment tasks are a faithful proxy; however, the framework lacks natural-language ambiguity, partial observability, and open-ended goal specification that characterize many real-world long-horizon problems. A direct test—evaluating whether the same γ scaling and transfer pattern appears on tasks that inject these features—would be required to substantiate the architectural claim.

    Authors: We appreciate this point regarding the scope of ScaleLogic. The framework is intentionally synthetic to allow independent control over horizon and expressiveness, enabling the clean scaling analysis. We do not claim it is a complete proxy for all real-world long-horizon problems. However, the transfer results to standard benchmarks demonstrate practical benefits. We have added a paragraph in the discussion section acknowledging the limitations of the synthetic setting and suggesting future work on more open-ended tasks. We maintain that the results indicate the autoregressive architecture is not fundamentally limited for long-horizon reasoning when trained appropriately. revision: partial

Circularity Check

0 steps flagged

No circularity: central claims rest on empirical scaling and transfer measurements

full rationale

The paper's derivation consists of introducing the ScaleLogic framework for controlled experiments, fitting observed power-law relationships between training compute T and proof depth D (with reported R² > 0.99), measuring monotonic increases in the exponent γ with logical expressiveness, and reporting downstream benchmark gains from more expressive curricula. These are direct empirical observations across RL methods rather than any equation or parameter that reduces to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes; the conclusion that shortcomings are not fundamental follows interpretively from the transfer results without definitional collapse. The analysis is self-contained against external benchmarks and does not rely on fitted inputs renamed as predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that ScaleLogic tasks are representative of real reasoning difficulty and that the observed power-law relationship generalizes beyond the tested RL methods.

free parameters (1)
  • scaling exponent gamma = 1.04 to 2.60
    Fitted separately for each logic expressiveness level to the observed compute-versus-depth data.
axioms (1)
  • domain assumption Synthetic logical tasks with controllable depth and expressiveness capture the essential bottlenecks of long-horizon reasoning in natural language and mathematics.
    Invoked when generalizing training results to downstream benchmarks.
invented entities (1)
  • ScaleLogic framework no independent evidence
    purpose: Provide independent control over proof horizon and logical expressiveness for RL training.
    New synthetic environment introduced to enable the scaling experiments.

pith-pipeline@v0.9.0 · 5636 in / 1343 out tokens · 32909 ms · 2026-05-12T03:05:35.344614+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 31 internal anchors

  1. [1]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. arXiv preprint arXiv:2212.10560 , year=

  2. [2]

    Unnatural instructions: Tuning language models with (almost) no human labor

    Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , author=. arXiv preprint arXiv:2212.09689 , year=

  3. [3]

    Contextual Integrity in

    Guangchen Lan and Huseyin A Inan and Sahar Abdelnabi and Janardhan Kulkarni and Lukas Wutschitz and Reza Shokri and Christopher Brinton and Robert Sim , booktitle=. Contextual Integrity in

  4. [4]

    Orca: Progressive learning from complex explanation traces of GPT-4

    Orca: Progressive Learning from Complex Explanation Traces of GPT-4 , author=. arXiv preprint arXiv:2306.02707 , year=

  5. [5]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. arXiv preprint arXiv:2309.12284 , year=

  6. [6]

    Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct , author=. arXiv preprint arXiv:2308.09583 , year=

  7. [7]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    WizardLM: Empowering Large Language Models to Follow Complex Instructions , author=. arXiv preprint arXiv:2304.12244 , year=

  8. [8]

    Agentin- struct: Toward generative teaching with agentic flows,

    AgentInstruct: Toward Generative Teaching with Agentic Flows , author=. arXiv preprint arXiv:2407.03502 , year=

  9. [14]

    arXiv preprint arXiv:2305.13691 , year=

    Few-Shot Data Synthesis for Open Domain Multi-Hop Question Answering , author=. arXiv preprint arXiv:2305.13691 , year=

  10. [16]

    International Conference on Learning Representations (ICLR) , year=

    Learning from Synthetic Data Improves Multi-hop Reasoning , author=. International Conference on Learning Representations (ICLR) , year=

  11. [22]

    Large Language Models are Zero-Shot Reasoners

    Large Language Models are Zero-Shot Reasoners , author=. arXiv preprint arXiv:2205.11916 , year=

  12. [23]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. arXiv preprint arXiv:2205.10625 , year=

  13. [37]

    International Conference on Learning Representations , year=

    Transformers Struggle to Learn to Search , author=. International Conference on Learning Representations , year=

  14. [38]

    Advances in Neural Information Processing Systems , year=

    Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. Advances in Neural Information Processing Systems , year=

  15. [49]

    Advances in Neural Information Processing Systems , volume=

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

  16. [50]

    2023 , note =

    AMC 12 Problems and Solutions , author =. 2023 , note =

  17. [51]

    2025 , note =

    AIME Problems and Solutions , author =. 2025 , note =

  18. [53]

    Proceedings of the 42nd International Conference on Machine Learning , year=

    GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? , author=. Proceedings of the 42nd International Conference on Machine Learning , year=

  19. [54]

    2025 , eprint=

    seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs , author=. 2025 , eprint=

  20. [55]

    International Conference on Machine Learning , year=

    ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning , author=. International Conference on Machine Learning , year=

  21. [56]

    Advances in Neural Information Processing Systems , year=

    Are Language Models Efficient Reasoners? A Perspective from Logic Programming , author=. Advances in Neural Information Processing Systems , year=

  22. [57]

    Tools and Algorithms for the Construction and Analysis of Systems , pages =

    de Moura, Leonardo and Bj. Tools and Algorithms for the Construction and Analysis of Systems , pages =. 2008 , publisher =

  23. [62]

    Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

    Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

  24. [63]

    The Pitfalls of Next-Token Prediction , booktitle =

    Gregor Bachmann and Vaishnavh Nagarajan , editor =. The Pitfalls of Next-Token Prediction , booktitle =. 2024 , url =

  25. [64]

    The pitfalls of next-token prediction

    Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of Machine Learning Re...

  26. [65]

    Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles

    Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, and Mingxuan Wang. Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles. arXiv preprint arXiv:2505.19914, 2025

  27. [66]

    Z3 : An efficient SMT solver

    Leonardo de Moura and Nikolaj Bj rner. Z3 : An efficient SMT solver. In Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of Lecture Notes in Computer Science, pages 337--340. Springer, 2008. doi:10.1007/978-3-540-78800-3_24

  28. [67]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025 a

  29. [68]

    G1: Teaching llms to reason on graphs with reinforcement learning

    Xiaojun Guo, Ang Li, Yifei Wang, Stefanie Jegelka, and Yisen Wang. G1: Teaching llms to reason on graphs with reinforcement learning. arXiv preprint arXiv:2505.18499, 2025 b

  30. [69]

    Resyn: Autonomously scaling synthetic environments for reasoning models

    Andre He, Nathaniel Weir, Kaj Bostrom, Allen Nie, Darion Cassel, Sam Bayless, and Huzefa Rangwala. Resyn: Autonomously scaling synthetic environments for reasoning models. arXiv preprint arXiv:2602.20117, 2026

  31. [70]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024

  32. [71]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  33. [72]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010...

  34. [73]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  35. [74]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025

  36. [75]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  37. [76]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  38. [77]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  39. [78]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  40. [79]

    Dhillon, David Brandfonbrener, and Rishabh Agarwal

    Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms. arXiv preprint arXiv:2510.13786, 2025

  41. [80]

    Contextual integrity in LLM s via reasoning and reinforcement learning

    Guangchen Lan, Huseyin A Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher Brinton, and Robert Sim. Contextual integrity in LLM s via reasoning and reinforcement learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  42. [81]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022

  43. [82]

    Zebralogic: On the scaling limits of llms for logical reasoning

    Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning. In International Conference on Machine Learning, 2025

  44. [83]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  45. [84]

    Saturn: Sat-based reinforcement learning to unleash llms reasoning

    Huanyu Liu, Ge Li, Jia Li, Hao Zhu, Kechi Zhang, and Yihong Dong. Saturn: Sat-based reinforcement learning to unleash llms reasoning. arXiv preprint arXiv:2505.16368, 2025 a

  46. [85]

    Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond

    Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. arXiv preprint arXiv:2505.19641, 2025 b

  47. [86]

    R-horizon: How far can your large reasoning model really go in breadth and depth? arXiv preprint arXiv:2510.08189, 2025

    Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, and Xunliang Cai. R-horizon: How far can your large reasoning model really go in breadth and depth? arXiv preprint arXiv:2510.08189, 2025

  48. [87]

    Reft: Reasoning with reinforced fine-tuning, 2024

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024

  49. [88]

    h1: Bootstrapping llms to reason over longer horizons via reinforcement learning

    Sumeet Ramesh Motwani, Alesia Ivanova, Ziyang Cai, Philip Torr, Riashat Islam, Shital Shah, Christian Schroeder de Witt, and Charles London. h1: Bootstrapping llms to reason over longer horizons via reinforcement learning. arXiv preprint arXiv:2510.07312, 2025

  50. [89]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

  51. [90]

    Are language models efficient reasoners? a perspective from logic programming

    Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, and Bernhard Sch \"o lkopf. Are language models efficient reasoners? a perspective from logic programming. In Advances in Neural Information Processing Systems, 2025

  52. [91]

    Reasoning models reason well, until they don't

    Revanth Rameshkumar, Jimson Huang, Yunxin Sun, Fei Xia, and Abulhair Saparov. Reasoning models reason well, until they don't. arXiv preprint arXiv:2510.22371, 2025

  53. [92]

    seqbench: A tunable benchmark to quantify sequential reasoning limits of llms, 2025

    Mohammad Ramezanali, Mo Vazifeh, and Paolo Santi. seqbench: A tunable benchmark to quantify sequential reasoning limits of llms, 2025. URL https://arxiv.org/abs/2509.16866

  54. [93]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023

  55. [94]

    Transformers struggle to learn to search

    Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, and He He. Transformers struggle to learn to search. In International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=qFVVBzXxR2V

  56. [95]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  57. [96]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024

  58. [97]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

  59. [98]

    Reasoninggym: Reasoningenvironmentsforreinforcementlearningwithverifiable rewards, 2025

    Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas K \"o pf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards. arXiv preprint arXiv:2505.24760, 2025

  60. [99]

    Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

    Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning. arXiv preprint arXiv:2509.25300, 2025

  61. [100]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026

  62. [101]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  63. [102]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2023

  64. [103]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024

  65. [104]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

  66. [105]

    Enhance reasoning for large language models in the game werewolf, 2024

    Shuang Wu, Liwen Zhu, Tao Yang, Shiwei Xu, Qiang Fu, Yang Wei, and Haobo Fu. Enhance reasoning for large language models in the game werewolf. arXiv preprint arXiv:2402.02330, 2024

  67. [106]

    arXiv preprint arXiv:2502.14768 , year=

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768, 2025

  68. [107]

    Qwen3 Technical Report

    An Yang et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  69. [108]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

  70. [109]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

  71. [110]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025

  72. [111]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025

  73. [112]

    Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. Gsm-infinite: How do your llms behave over infinitely increasing context length and reasoning complexity? In Proceedings of the 42nd International Conference on Machine Learning, 2025