arxiv: 2605.06638 · v2 · submitted 2026-05-07 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Tianle Wang , Zhaoyang Wang , Guangchen Lan , Xinpeng Wei , Sipeng Zhang , Guanwen Qiu , Abulhair Saparov

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:05 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords reinforcement learninglarge language modelslong-horizon reasoninglogical expressivenessscaling lawssynthetic benchmarkstransfer learning

0 comments

The pith

Reinforcement learning overcomes LLM long-horizon reasoning limits when training uses more expressive logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors built ScaleLogic, a controlled synthetic environment where they can separately dial up the required reasoning depth and the complexity of the logical rules. Training LLMs with RL on these tasks produces a clean power-law relationship between compute used and reasoning depth, with the scaling exponent growing worse as the logic gains features like conjunction, disjunction, negation, and quantifiers. Models trained under the more expressive regimes transfer more effectively to downstream math and reasoning benchmarks, gaining up to ten points and using compute more efficiently. The results indicate that current shortcomings in chaining many reasoning steps are not fixed properties of the transformer architecture but respond to choices in training data and procedure.

Core claim

RL training compute T follows T ∝ D^γ (R² > 0.99) with respect to reasoning depth D; the exponent γ rises monotonically from 1.04 in simple implication logic to 2.60 in first-order logic that includes and, or, not, and universal quantification. More expressive training produces larger downstream gains (up to +10.66 points) and more compute-efficient transfer on mathematics and general reasoning benchmarks. The same power-law relation appears across multiple RL algorithms, and curriculum ordering further improves scaling efficiency.

What carries the argument

ScaleLogic, a synthetic logical-reasoning framework that independently controls proof-planning depth (horizon) and the expressiveness of the underlying logic, ranging from implication-only to full first-order logic with conjunction, disjunction, negation, and universal quantification.

If this is right

Training on logics with higher expressiveness produces both larger absolute gains and better compute efficiency on downstream tasks.
The scaling exponent increases with logical expressiveness, so deeper reasoning becomes disproportionately more expensive in richer logics.
Curriculum ordering during RL substantially reduces the compute needed to reach a given depth.
The power-law relationship between compute and depth is reproducible across different RL algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Synthetic data generators for LLM reasoning training should emphasize logical expressiveness rather than simply increasing depth.
If real-world tasks can be decomposed into similar depth-expressiveness coordinates, targeted RL curricula could be designed without requiring larger models.
The observed scaling may generalize to other structured domains such as planning or program synthesis where horizon and rule complexity can be varied independently.

Load-bearing premise

That gains measured on the synthetic ScaleLogic tasks and their transfer to standard benchmarks serve as a faithful proxy for the long-horizon reasoning problems that appear in real applications.

What would settle it

Finding that models trained to high performance on ScaleLogic still show no meaningful improvement on a broad suite of authentic, long-chain reasoning problems outside the synthetic setting would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.06638 by Abulhair Saparov, Guangchen Lan, Guanwen Qiu, Sipeng Zhang, Tianle Wang, Xinpeng Wei, Zhaoyang Wang.

**Figure 1.** Figure 1: Overview of SCALELOGIC. Each problem has B candidate proof trees, exactly one of which has a provable conclusion; the others are made unprovable by corrupting one axiom. The depth D controls proof depth. Left: Implication-only reasoning. Right: The most expressive logic setting (referred to as + Quantification in Section 3.2) combines conjunction, disjunction, negation, and universal quantification. 2 Rela… view at source ↗

**Figure 2.** Figure 2: Training cost scales as a power law with reasoning depth, with exponent γ governed by logical expressiveness. (a) Scatter points show the training steps T required to reach the 90% accuracy threshold as a function of reasoning depth D. Solid lines show power-law fits T ∝ D γ . (b) Fitted exponent γ increases monotonically when expressiveness increases from Implication-only (γ = 1.04) to + Quantification (γ… view at source ↗

**Figure 3.** Figure 3: Downstream transfer from synthetic reasoning training. (a) All settings outperform the base model, with richer logical settings producing larger and more sustained gains. (b) Average downstream performance across logical settings under fixed depth (D = 12) and fixed compute (∼ 100 steps). Both controls exhibit the same monotone trend: More expressive training settings yield stronger downstream performance.… view at source ↗

**Figure 4.** Figure 4: Effect of training distribution and RL algorithm on scaling efficiency in the + Conjunction setting. Each curve aggregates three independent seeds; shading denotes ±1 standard deviation in log-space across seeds. (a) Curriculum training yields the lowest exponent (γ = 1.33), whereas difficult-only training yields the highest (γ = 2.36). (b) All RL algorithms follow power-law scaling (R 2 > 0.99), with expo… view at source ↗

**Figure 5.** Figure 5: Out-of-distribution generalization across reasoning depths. (a) Increasing the training depth consistently extends the range of solvable test depths. (b) OOD generalization remains bounded: even the models trained at the largest depths fall to random at Dtest/Dtrain ≈3. additional algorithms: base GRPO [Shao et al., 2024] (without DAPO extension) and GSPO [Zheng et al., 2025], a recent sequence-level polic… view at source ↗

**Figure 6.** Figure 6: Power-law (orange) vs. exponential (blue) fits for each expressiveness setting. view at source ↗

**Figure 6.** Figure 6: Power-law (orange) vs. exponential (blue) fits for each expressiveness setting. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Power-law scaling under 85% accuracy threshold (cf. main-text view at source ↗

**Figure 7.** Figure 7: Power-law scaling under 85% accuracy threshold (cf. main-text [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Power-law fits T(D) = a · D γ under four alternative compute measures, complementing view at source ↗

**Figure 8.** Figure 8: Power-law fits T(D) = a · D γ under four alternative compute measures, complementing [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Effect of candidate count B at fixed depth D = 8 under the + Quantification setting. (a) Training steps to 90% accuracy follow a power law in B (γB = 1.41, R 2 = 0.984, ∆AIC = +7.0 vs. exponential). (b) Average downstream performance across the eight reasoning benchmarks of Section 4.3, plotted against candidate count B. Gains saturate quickly: performance rises from 52.5% at B = 2 to 55.7% at B = 4 (+3.2 … view at source ↗

**Figure 10.** Figure 10: Uniform vs. curriculum training under the most expressive view at source ↗

**Figure 10.** Figure 10: Uniform vs. curriculum training under the most expressive [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Training trajectories on + Conjunction under three data distributions. Each row: response length (left), entropy (middle), validation accuracy (right); legend (training depth D) is shared across the three sub-panels. Note that axis ranges may differ across rows to improve readability. The scaling behavior at 8B closely mirrors that at 4B. All five expressiveness settings continue to follow a clean power l… view at source ↗

**Figure 12.** Figure 12: Cross-scale replication on Qwen3-8B. (a) Training steps to convergence vs. reasoning depth across five expressiveness levels on log-log axes. Solid lines show power-law fits T ∝ D γ . (b) Fitted γ increases monotonically when expressiveness increases from Implication-only (γ = 0.99) to the + Quantification setting (γ = 2.53), mirroring the 4B picture in view at source ↗

**Figure 13.** Figure 13: Training trajectories for all five settings. Each row: response length (left), entropy (middle), validation view at source ↗

**Figure 13.** Figure 13: Training trajectories for all five settings. Each row: response length (left), entropy (middle), validation [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: Number of correct answers out of 30 for frontier LLMs on view at source ↗

**Figure 14.** Figure 14: Number of correct answers out of 30 for frontier LLMs on [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗

**Figure 15.** Figure 15: MATH-500 #80: full reasoning trace of our model (top, view at source ↗

**Figure 15.** Figure 15: MATH-500 #80: full reasoning trace of our model (top, [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

read the original abstract

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. Observed LLM shortcomings in long-horizon reasoning have raised the prospect that these shortcomings are fundamental to the autoregressive transformer architecture. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^{\gamma}$, $R^{2} > 0.99$), and that the scaling exponent $\gamma$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency. More broadly, our results demonstrate that LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture, and can be addressed by improved training methodology and data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScaleLogic gives a clean controlled setup for measuring how logical expressiveness changes RL compute scaling with depth, plus some downstream transfer gains, but the synthetic tasks limit how much this says about real long-horizon problems.

read the letter

The key takeaway is that ScaleLogic lets them separately tune proof depth and logical expressiveness in synthetic tasks, revealing that RL compute scales as a power law with depth and that the exponent rises with more complex logics. Training on the expressive versions also leads to stronger downstream gains on math benchmarks. They do a good job running this across several RL methods and showing curriculum training improves efficiency. The power-law fits look tight based on the reported R2 values, and the transfer results suggest that data design can matter as much as scale. The main limitation is that these tasks are all closed-form logical entailment without the ambiguity, partial observability, or open goals that make real long-horizon problems hard. So while the results challenge the idea that autoregressive models are inherently limited, the evidence is still tied to this clean synthetic proxy. More testing on noisier domains would strengthen the broader claim. This is worth reading for people studying post-training methods or synthetic data for reasoning. It has enough new experimental control to merit peer review, though the interpretation of the findings could use some tightening around generalization.

Referee Report

3 major / 1 minor

Summary. The paper introduces ScaleLogic, a synthetic logical reasoning environment that independently varies proof depth D (horizon) and logical expressiveness (from implication-only to full first-order logic with and/or/not/forall). It reports that RL training compute T obeys a power law T ∝ D^γ with R² > 0.99, where the exponent γ rises monotonically from 1.04 to 2.60 as expressiveness increases; more expressive curricula produce larger downstream gains (up to +10.66 points) on math and general reasoning benchmarks and more efficient transfer. The authors conclude that long-horizon reasoning deficits in LLMs are not fundamental to the autoregressive architecture and can be mitigated by improved training methodology and data.

Significance. If the scaling and transfer results are robust, the work supplies a controlled testbed for dissecting how horizon and logical complexity interact with RL compute, showing that what the model is trained on (expressiveness) matters as much as how much it is trained. The use of multiple RL algorithms and curriculum variants, together with the clean power-law fits, offers a reproducible template for studying reasoning scaling that is currently rare in the LLM literature.

major comments (3)

[Abstract] Abstract: the central claim that 'LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture' rests on ScaleLogic results transferring to downstream benchmarks, yet the abstract provides no information on which specific benchmarks were used, the number of evaluation runs, or whether the +10.66 point gains survive multiple-comparison correction; without these details the transfer evidence cannot be evaluated as load-bearing support for the architectural conclusion.
[Abstract] Abstract: the reported R² > 0.99 for the power-law fits T ∝ D^γ is presented without stating the range of D values, number of data points per fit, or any data-exclusion criteria; because the monotonic rise in γ with expressiveness is the key quantitative result used to argue that training methodology overcomes horizon limits, the absence of these experimental controls makes the scaling claim difficult to assess.
[Abstract] The manuscript's conclusion that improved RL and data suffice to address long-horizon deficits assumes ScaleLogic's closed, unambiguous entailment tasks are a faithful proxy; however, the framework lacks natural-language ambiguity, partial observability, and open-ended goal specification that characterize many real-world long-horizon problems. A direct test—evaluating whether the same γ scaling and transfer pattern appears on tasks that inject these features—would be required to substantiate the architectural claim.

minor comments (1)

[Abstract] The abstract states that the power-law relationship 'holds across multiple RL methods' but does not name the methods or report per-method γ values; adding this information would improve clarity without altering the central argument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful review and valuable suggestions. We have made revisions to the abstract to address the concerns about missing details. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture' rests on ScaleLogic results transferring to downstream benchmarks, yet the abstract provides no information on which specific benchmarks were used, the number of evaluation runs, or whether the +10.66 point gains survive multiple-comparison correction; without these details the transfer evidence cannot be evaluated as load-bearing support for the architectural conclusion.

Authors: We agree with this observation. The abstract has been revised to include information on the specific benchmarks used for downstream evaluation and to note that the gains are based on multiple independent runs. Details on statistical significance and any corrections are provided in the main text. revision: yes
Referee: [Abstract] Abstract: the reported R² > 0.99 for the power-law fits T ∝ D^γ is presented without stating the range of D values, number of data points per fit, or any data-exclusion criteria; because the monotonic rise in γ with expressiveness is the key quantitative result used to argue that training methodology overcomes horizon limits, the absence of these experimental controls makes the scaling claim difficult to assess.

Authors: We agree that the abstract should specify the experimental controls for the power-law fits. We have revised the abstract to include the range of D values used and the number of data points per fit. No data points were excluded in the fits, as described in the methods section of the manuscript. revision: yes
Referee: [Abstract] The manuscript's conclusion that improved RL and data suffice to address long-horizon deficits assumes ScaleLogic's closed, unambiguous entailment tasks are a faithful proxy; however, the framework lacks natural-language ambiguity, partial observability, and open-ended goal specification that characterize many real-world long-horizon problems. A direct test—evaluating whether the same γ scaling and transfer pattern appears on tasks that inject these features—would be required to substantiate the architectural claim.

Authors: We appreciate this point regarding the scope of ScaleLogic. The framework is intentionally synthetic to allow independent control over horizon and expressiveness, enabling the clean scaling analysis. We do not claim it is a complete proxy for all real-world long-horizon problems. However, the transfer results to standard benchmarks demonstrate practical benefits. We have added a paragraph in the discussion section acknowledging the limitations of the synthetic setting and suggesting future work on more open-ended tasks. We maintain that the results indicate the autoregressive architecture is not fundamentally limited for long-horizon reasoning when trained appropriately. revision: partial

Circularity Check

0 steps flagged

No circularity: central claims rest on empirical scaling and transfer measurements

full rationale

The paper's derivation consists of introducing the ScaleLogic framework for controlled experiments, fitting observed power-law relationships between training compute T and proof depth D (with reported R² > 0.99), measuring monotonic increases in the exponent γ with logical expressiveness, and reporting downstream benchmark gains from more expressive curricula. These are direct empirical observations across RL methods rather than any equation or parameter that reduces to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes; the conclusion that shortcomings are not fundamental follows interpretively from the transfer results without definitional collapse. The analysis is self-contained against external benchmarks and does not rely on fitted inputs renamed as predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that ScaleLogic tasks are representative of real reasoning difficulty and that the observed power-law relationship generalizes beyond the tested RL methods.

free parameters (1)

scaling exponent gamma = 1.04 to 2.60
Fitted separately for each logic expressiveness level to the observed compute-versus-depth data.

axioms (1)

domain assumption Synthetic logical tasks with controllable depth and expressiveness capture the essential bottlenecks of long-horizon reasoning in natural language and mathematics.
Invoked when generalizing training results to downstream benchmarks.

invented entities (1)

ScaleLogic framework no independent evidence
purpose: Provide independent control over proof horizon and logical expressiveness for RL training.
New synthetic environment introduced to enable the scaling experiments.

pith-pipeline@v0.9.0 · 5636 in / 1343 out tokens · 32909 ms · 2026-05-12T03:05:35.344614+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that the RL training compute T follows a power law with respect to reasoning depth D (T ∝ D^γ, R² > 0.99), and that the scaling exponent γ increases monotonically with logical expressiveness, from 1.04 to 2.60.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SCALELOGIC, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 31 internal anchors

[1]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. arXiv preprint arXiv:2212.10560 , year=

work page internal anchor Pith review arXiv
[2]

Unnatural instructions: Tuning language models with (almost) no human labor

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , author=. arXiv preprint arXiv:2212.09689 , year=

work page arXiv
[3]

Contextual Integrity in

Guangchen Lan and Huseyin A Inan and Sahar Abdelnabi and Janardhan Kulkarni and Lukas Wutschitz and Reza Shokri and Christopher Brinton and Robert Sim , booktitle=. Contextual Integrity in

work page
[4]

Orca: Progressive learning from complex explanation traces of GPT-4

Orca: Progressive Learning from Complex Explanation Traces of GPT-4 , author=. arXiv preprint arXiv:2306.02707 , year=

work page arXiv
[5]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. arXiv preprint arXiv:2309.12284 , year=

work page internal anchor Pith review arXiv
[6]

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct , author=. arXiv preprint arXiv:2308.09583 , year=

work page arXiv
[7]

WizardLM: Empowering large pre-trained language models to follow complex instructions

WizardLM: Empowering Large Language Models to Follow Complex Instructions , author=. arXiv preprint arXiv:2304.12244 , year=

work page internal anchor Pith review arXiv
[8]

Agentin- struct: Toward generative teaching with agentic flows,

AgentInstruct: Toward Generative Teaching with Agentic Flows , author=. arXiv preprint arXiv:2407.03502 , year=

work page arXiv
[14]

arXiv preprint arXiv:2305.13691 , year=

Few-Shot Data Synthesis for Open Domain Multi-Hop Question Answering , author=. arXiv preprint arXiv:2305.13691 , year=

work page arXiv
[16]

International Conference on Learning Representations (ICLR) , year=

Learning from Synthetic Data Improves Multi-hop Reasoning , author=. International Conference on Learning Representations (ICLR) , year=

work page
[22]

Large Language Models are Zero-Shot Reasoners

Large Language Models are Zero-Shot Reasoners , author=. arXiv preprint arXiv:2205.11916 , year=

work page internal anchor Pith review arXiv
[23]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. arXiv preprint arXiv:2205.10625 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

International Conference on Learning Representations , year=

Transformers Struggle to Learn to Search , author=. International Conference on Learning Representations , year=

work page
[38]

Advances in Neural Information Processing Systems , year=

Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. Advances in Neural Information Processing Systems , year=

work page
[49]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

work page
[50]

2023 , note =

AMC 12 Problems and Solutions , author =. 2023 , note =

work page 2023
[51]

2025 , note =

AIME Problems and Solutions , author =. 2025 , note =

work page 2025
[53]

Proceedings of the 42nd International Conference on Machine Learning , year=

GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? , author=. Proceedings of the 42nd International Conference on Machine Learning , year=

work page
[54]

2025 , eprint=

seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs , author=. 2025 , eprint=

work page 2025
[55]

International Conference on Machine Learning , year=

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning , author=. International Conference on Machine Learning , year=

work page
[56]

Advances in Neural Information Processing Systems , year=

Are Language Models Efficient Reasoners? A Perspective from Logic Programming , author=. Advances in Neural Information Processing Systems , year=

work page
[57]

Tools and Algorithms for the Construction and Analysis of Systems , pages =

de Moura, Leonardo and Bj. Tools and Algorithms for the Construction and Analysis of Systems , pages =. 2008 , publisher =

work page 2008
[62]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

work page
[63]

The Pitfalls of Next-Token Prediction , booktitle =

Gregor Bachmann and Vaishnavh Nagarajan , editor =. The Pitfalls of Next-Token Prediction , booktitle =. 2024 , url =

work page 2024
[64]

The pitfalls of next-token prediction

Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of Machine Learning Re...

work page 2024
[65]

Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles

Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, and Mingxuan Wang. Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles. arXiv preprint arXiv:2505.19914, 2025

work page arXiv 2025
[66]

Z3 : An efficient SMT solver

Leonardo de Moura and Nikolaj Bj rner. Z3 : An efficient SMT solver. In Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of Lecture Notes in Computer Science, pages 337--340. Springer, 2008. doi:10.1007/978-3-540-78800-3_24

work page doi:10.1007/978-3-540-78800-3_24 2008
[67]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

G1: Teaching llms to reason on graphs with reinforcement learning

Xiaojun Guo, Ang Li, Yifei Wang, Stefanie Jegelka, and Yisen Wang. G1: Teaching llms to reason on graphs with reinforcement learning. arXiv preprint arXiv:2505.18499, 2025 b

work page arXiv 2025
[69]

Resyn: Autonomously scaling synthetic environments for reasoning models

Andre He, Nathaniel Weir, Kaj Bostrom, Allen Nie, Darion Cassel, Sam Bayless, and Huzefa Rangwala. Resyn: Autonomously scaling synthetic environments for reasoning models. arXiv preprint arXiv:2602.20117, 2026

work page arXiv 2026
[70]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[72]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010...

work page internal anchor Pith review arXiv 2010
[73]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[74]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review arXiv 2025
[75]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[79]

Dhillon, David Brandfonbrener, and Rishabh Agarwal

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms. arXiv preprint arXiv:2510.13786, 2025

work page arXiv 2025
[80]

Contextual integrity in LLM s via reasoning and reinforcement learning

Guangchen Lan, Huseyin A Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher Brinton, and Robert Sim. Contextual integrity in LLM s via reasoning and reinforcement learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[81]

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022

work page internal anchor Pith review arXiv 2022
[82]

Zebralogic: On the scaling limits of llms for logical reasoning

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning. In International Conference on Machine Learning, 2025

work page 2025
[83]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[84]

Saturn: Sat-based reinforcement learning to unleash llms reasoning

Huanyu Liu, Ge Li, Jia Li, Hao Zhu, Kechi Zhang, and Yihong Dong. Saturn: Sat-based reinforcement learning to unleash llms reasoning. arXiv preprint arXiv:2505.16368, 2025 a

work page arXiv 2025
[85]

Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond

Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. arXiv preprint arXiv:2505.19641, 2025 b

work page arXiv 2025
[86]

R-horizon: How far can your large reasoning model really go in breadth and depth? arXiv preprint arXiv:2510.08189, 2025

Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, and Xunliang Cai. R-horizon: How far can your large reasoning model really go in breadth and depth? arXiv preprint arXiv:2510.08189, 2025

work page arXiv 2025
[87]

Reft: Reasoning with reinforced fine-tuning, 2024

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024

work page arXiv 2024
[88]

h1: Bootstrapping llms to reason over longer horizons via reinforcement learning

Sumeet Ramesh Motwani, Alesia Ivanova, Ziyang Cai, Philip Torr, Riashat Islam, Shital Shah, Christian Schroeder de Witt, and Charles London. h1: Bootstrapping llms to reason over longer horizons via reinforcement learning. arXiv preprint arXiv:2510.07312, 2025

work page arXiv 2025
[89]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[90]

Are language models efficient reasoners? a perspective from logic programming

Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, and Bernhard Sch \"o lkopf. Are language models efficient reasoners? a perspective from logic programming. In Advances in Neural Information Processing Systems, 2025

work page 2025
[91]

Reasoning models reason well, until they don't

Revanth Rameshkumar, Jimson Huang, Yunxin Sun, Fei Xia, and Abulhair Saparov. Reasoning models reason well, until they don't. arXiv preprint arXiv:2510.22371, 2025

work page arXiv 2025
[92]

seqbench: A tunable benchmark to quantify sequential reasoning limits of llms, 2025

Mohammad Ramezanali, Mo Vazifeh, and Paolo Santi. seqbench: A tunable benchmark to quantify sequential reasoning limits of llms, 2025. URL https://arxiv.org/abs/2509.16866

work page arXiv 2025
[93]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[94]

Transformers struggle to learn to search

Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, and He He. Transformers struggle to learn to search. In International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=qFVVBzXxR2V

work page 2025
[95]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[96]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[97]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[98]

Reasoninggym: Reasoningenvironmentsforreinforcementlearningwithverifiable rewards, 2025

Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas K \"o pf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards. arXiv preprint arXiv:2505.24760, 2025

work page arXiv 2025
[99]

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning. arXiv preprint arXiv:2509.25300, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[100]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[101]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[102]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[103]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024

work page 2024
[104]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[105]

Enhance reasoning for large language models in the game werewolf, 2024

Shuang Wu, Liwen Zhu, Tao Yang, Shiwei Xu, Qiang Fu, Yang Wei, and Haobo Fu. Enhance reasoning for large language models in the game werewolf. arXiv preprint arXiv:2402.02330, 2024

work page arXiv 2024
[106]

arXiv preprint arXiv:2502.14768 , year=

Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768, 2025

work page arXiv 2025
[107]

Qwen3 Technical Report

An Yang et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[108]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[109]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[110]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025

work page internal anchor Pith review arXiv 2025
[111]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[112]

Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. Gsm-infinite: How do your llms behave over infinitely increasing context length and reasoning complexity? In Proceedings of the 42nd International Conference on Machine Learning, 2025

work page 2025