Recognition: 2 theorem links
· Lean TheoremCan RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
Pith reviewed 2026-05-12 03:05 UTC · model grok-4.3
The pith
Reinforcement learning overcomes LLM long-horizon reasoning limits when training uses more expressive logic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RL training compute T follows T ∝ D^γ (R² > 0.99) with respect to reasoning depth D; the exponent γ rises monotonically from 1.04 in simple implication logic to 2.60 in first-order logic that includes and, or, not, and universal quantification. More expressive training produces larger downstream gains (up to +10.66 points) and more compute-efficient transfer on mathematics and general reasoning benchmarks. The same power-law relation appears across multiple RL algorithms, and curriculum ordering further improves scaling efficiency.
What carries the argument
ScaleLogic, a synthetic logical-reasoning framework that independently controls proof-planning depth (horizon) and the expressiveness of the underlying logic, ranging from implication-only to full first-order logic with conjunction, disjunction, negation, and universal quantification.
If this is right
- Training on logics with higher expressiveness produces both larger absolute gains and better compute efficiency on downstream tasks.
- The scaling exponent increases with logical expressiveness, so deeper reasoning becomes disproportionately more expensive in richer logics.
- Curriculum ordering during RL substantially reduces the compute needed to reach a given depth.
- The power-law relationship between compute and depth is reproducible across different RL algorithms.
Where Pith is reading between the lines
- Synthetic data generators for LLM reasoning training should emphasize logical expressiveness rather than simply increasing depth.
- If real-world tasks can be decomposed into similar depth-expressiveness coordinates, targeted RL curricula could be designed without requiring larger models.
- The observed scaling may generalize to other structured domains such as planning or program synthesis where horizon and rule complexity can be varied independently.
Load-bearing premise
That gains measured on the synthetic ScaleLogic tasks and their transfer to standard benchmarks serve as a faithful proxy for the long-horizon reasoning problems that appear in real applications.
What would settle it
Finding that models trained to high performance on ScaleLogic still show no meaningful improvement on a broad suite of authentic, long-chain reasoning problems outside the synthetic setting would falsify the central claim.
Figures
read the original abstract
Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. Observed LLM shortcomings in long-horizon reasoning have raised the prospect that these shortcomings are fundamental to the autoregressive transformer architecture. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^{\gamma}$, $R^{2} > 0.99$), and that the scaling exponent $\gamma$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency. More broadly, our results demonstrate that LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture, and can be addressed by improved training methodology and data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ScaleLogic, a synthetic logical reasoning environment that independently varies proof depth D (horizon) and logical expressiveness (from implication-only to full first-order logic with and/or/not/forall). It reports that RL training compute T obeys a power law T ∝ D^γ with R² > 0.99, where the exponent γ rises monotonically from 1.04 to 2.60 as expressiveness increases; more expressive curricula produce larger downstream gains (up to +10.66 points) on math and general reasoning benchmarks and more efficient transfer. The authors conclude that long-horizon reasoning deficits in LLMs are not fundamental to the autoregressive architecture and can be mitigated by improved training methodology and data.
Significance. If the scaling and transfer results are robust, the work supplies a controlled testbed for dissecting how horizon and logical complexity interact with RL compute, showing that what the model is trained on (expressiveness) matters as much as how much it is trained. The use of multiple RL algorithms and curriculum variants, together with the clean power-law fits, offers a reproducible template for studying reasoning scaling that is currently rare in the LLM literature.
major comments (3)
- [Abstract] Abstract: the central claim that 'LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture' rests on ScaleLogic results transferring to downstream benchmarks, yet the abstract provides no information on which specific benchmarks were used, the number of evaluation runs, or whether the +10.66 point gains survive multiple-comparison correction; without these details the transfer evidence cannot be evaluated as load-bearing support for the architectural conclusion.
- [Abstract] Abstract: the reported R² > 0.99 for the power-law fits T ∝ D^γ is presented without stating the range of D values, number of data points per fit, or any data-exclusion criteria; because the monotonic rise in γ with expressiveness is the key quantitative result used to argue that training methodology overcomes horizon limits, the absence of these experimental controls makes the scaling claim difficult to assess.
- [Abstract] The manuscript's conclusion that improved RL and data suffice to address long-horizon deficits assumes ScaleLogic's closed, unambiguous entailment tasks are a faithful proxy; however, the framework lacks natural-language ambiguity, partial observability, and open-ended goal specification that characterize many real-world long-horizon problems. A direct test—evaluating whether the same γ scaling and transfer pattern appears on tasks that inject these features—would be required to substantiate the architectural claim.
minor comments (1)
- [Abstract] The abstract states that the power-law relationship 'holds across multiple RL methods' but does not name the methods or report per-method γ values; adding this information would improve clarity without altering the central argument.
Simulated Author's Rebuttal
We thank the referee for the careful review and valuable suggestions. We have made revisions to the abstract to address the concerns about missing details. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture' rests on ScaleLogic results transferring to downstream benchmarks, yet the abstract provides no information on which specific benchmarks were used, the number of evaluation runs, or whether the +10.66 point gains survive multiple-comparison correction; without these details the transfer evidence cannot be evaluated as load-bearing support for the architectural conclusion.
Authors: We agree with this observation. The abstract has been revised to include information on the specific benchmarks used for downstream evaluation and to note that the gains are based on multiple independent runs. Details on statistical significance and any corrections are provided in the main text. revision: yes
-
Referee: [Abstract] Abstract: the reported R² > 0.99 for the power-law fits T ∝ D^γ is presented without stating the range of D values, number of data points per fit, or any data-exclusion criteria; because the monotonic rise in γ with expressiveness is the key quantitative result used to argue that training methodology overcomes horizon limits, the absence of these experimental controls makes the scaling claim difficult to assess.
Authors: We agree that the abstract should specify the experimental controls for the power-law fits. We have revised the abstract to include the range of D values used and the number of data points per fit. No data points were excluded in the fits, as described in the methods section of the manuscript. revision: yes
-
Referee: [Abstract] The manuscript's conclusion that improved RL and data suffice to address long-horizon deficits assumes ScaleLogic's closed, unambiguous entailment tasks are a faithful proxy; however, the framework lacks natural-language ambiguity, partial observability, and open-ended goal specification that characterize many real-world long-horizon problems. A direct test—evaluating whether the same γ scaling and transfer pattern appears on tasks that inject these features—would be required to substantiate the architectural claim.
Authors: We appreciate this point regarding the scope of ScaleLogic. The framework is intentionally synthetic to allow independent control over horizon and expressiveness, enabling the clean scaling analysis. We do not claim it is a complete proxy for all real-world long-horizon problems. However, the transfer results to standard benchmarks demonstrate practical benefits. We have added a paragraph in the discussion section acknowledging the limitations of the synthetic setting and suggesting future work on more open-ended tasks. We maintain that the results indicate the autoregressive architecture is not fundamentally limited for long-horizon reasoning when trained appropriately. revision: partial
Circularity Check
No circularity: central claims rest on empirical scaling and transfer measurements
full rationale
The paper's derivation consists of introducing the ScaleLogic framework for controlled experiments, fitting observed power-law relationships between training compute T and proof depth D (with reported R² > 0.99), measuring monotonic increases in the exponent γ with logical expressiveness, and reporting downstream benchmark gains from more expressive curricula. These are direct empirical observations across RL methods rather than any equation or parameter that reduces to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes; the conclusion that shortcomings are not fundamental follows interpretively from the transfer results without definitional collapse. The analysis is self-contained against external benchmarks and does not rely on fitted inputs renamed as predictions.
Axiom & Free-Parameter Ledger
free parameters (1)
- scaling exponent gamma =
1.04 to 2.60
axioms (1)
- domain assumption Synthetic logical tasks with controllable depth and expressiveness capture the essential bottlenecks of long-horizon reasoning in natural language and mathematics.
invented entities (1)
-
ScaleLogic framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that the RL training compute T follows a power law with respect to reasoning depth D (T ∝ D^γ, R² > 0.99), and that the scaling exponent γ increases monotonically with logical expressiveness, from 1.04 to 2.60.
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SCALELOGIC, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. arXiv preprint arXiv:2212.10560 , year=
work page internal anchor Pith review arXiv
-
[2]
Unnatural instructions: Tuning language models with (almost) no human labor
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , author=. arXiv preprint arXiv:2212.09689 , year=
-
[3]
Guangchen Lan and Huseyin A Inan and Sahar Abdelnabi and Janardhan Kulkarni and Lukas Wutschitz and Reza Shokri and Christopher Brinton and Robert Sim , booktitle=. Contextual Integrity in
-
[4]
Orca: Progressive learning from complex explanation traces of GPT-4
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 , author=. arXiv preprint arXiv:2306.02707 , year=
-
[5]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. arXiv preprint arXiv:2309.12284 , year=
work page internal anchor Pith review arXiv
-
[6]
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct , author=. arXiv preprint arXiv:2308.09583 , year=
-
[7]
WizardLM: Empowering large pre-trained language models to follow complex instructions
WizardLM: Empowering Large Language Models to Follow Complex Instructions , author=. arXiv preprint arXiv:2304.12244 , year=
work page internal anchor Pith review arXiv
-
[8]
Agentin- struct: Toward generative teaching with agentic flows,
AgentInstruct: Toward Generative Teaching with Agentic Flows , author=. arXiv preprint arXiv:2407.03502 , year=
-
[14]
arXiv preprint arXiv:2305.13691 , year=
Few-Shot Data Synthesis for Open Domain Multi-Hop Question Answering , author=. arXiv preprint arXiv:2305.13691 , year=
-
[16]
International Conference on Learning Representations (ICLR) , year=
Learning from Synthetic Data Improves Multi-hop Reasoning , author=. International Conference on Learning Representations (ICLR) , year=
-
[22]
Large Language Models are Zero-Shot Reasoners
Large Language Models are Zero-Shot Reasoners , author=. arXiv preprint arXiv:2205.11916 , year=
work page internal anchor Pith review arXiv
-
[23]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. arXiv preprint arXiv:2205.10625 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
International Conference on Learning Representations , year=
Transformers Struggle to Learn to Search , author=. International Conference on Learning Representations , year=
-
[38]
Advances in Neural Information Processing Systems , year=
Understanding Transformer Reasoning Capabilities via Graph Algorithms , author=. Advances in Neural Information Processing Systems , year=
-
[49]
Advances in Neural Information Processing Systems , volume=
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
- [50]
- [51]
-
[53]
Proceedings of the 42nd International Conference on Machine Learning , year=
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? , author=. Proceedings of the 42nd International Conference on Machine Learning , year=
-
[54]
seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs , author=. 2025 , eprint=
work page 2025
-
[55]
International Conference on Machine Learning , year=
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning , author=. International Conference on Machine Learning , year=
-
[56]
Advances in Neural Information Processing Systems , year=
Are Language Models Efficient Reasoners? A Perspective from Logic Programming , author=. Advances in Neural Information Processing Systems , year=
-
[57]
Tools and Algorithms for the Construction and Analysis of Systems , pages =
de Moura, Leonardo and Bj. Tools and Algorithms for the Construction and Analysis of Systems , pages =. 2008 , publisher =
work page 2008
-
[62]
Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
-
[63]
The Pitfalls of Next-Token Prediction , booktitle =
Gregor Bachmann and Vaishnavh Nagarajan , editor =. The Pitfalls of Next-Token Prediction , booktitle =. 2024 , url =
work page 2024
-
[64]
The pitfalls of next-token prediction
Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of Machine Learning Re...
work page 2024
-
[65]
Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles
Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, and Mingxuan Wang. Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles. arXiv preprint arXiv:2505.19914, 2025
-
[66]
Leonardo de Moura and Nikolaj Bj rner. Z3 : An efficient SMT solver. In Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of Lecture Notes in Computer Science, pages 337--340. Springer, 2008. doi:10.1007/978-3-540-78800-3_24
-
[67]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
G1: Teaching llms to reason on graphs with reinforcement learning
Xiaojun Guo, Ang Li, Yifei Wang, Stefanie Jegelka, and Yisen Wang. G1: Teaching llms to reason on graphs with reinforcement learning. arXiv preprint arXiv:2505.18499, 2025 b
-
[69]
Resyn: Autonomously scaling synthetic environments for reasoning models
Andre He, Nathaniel Weir, Kaj Bostrom, Allen Nie, Darion Cassel, Sam Bayless, and Huzefa Rangwala. Resyn: Autonomously scaling synthetic environments for reasoning models. arXiv preprint arXiv:2602.20117, 2026
-
[70]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[72]
Scaling Laws for Autoregressive Generative Modeling
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010...
work page internal anchor Pith review arXiv 2010
-
[73]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[74]
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025
work page internal anchor Pith review arXiv 2025
-
[75]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[79]
Dhillon, David Brandfonbrener, and Rishabh Agarwal
Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms. arXiv preprint arXiv:2510.13786, 2025
-
[80]
Contextual integrity in LLM s via reasoning and reinforcement learning
Guangchen Lan, Huseyin A Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher Brinton, and Robert Sim. Contextual integrity in LLM s via reasoning and reinforcement learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[81]
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022
work page internal anchor Pith review arXiv 2022
-
[82]
Zebralogic: On the scaling limits of llms for logical reasoning
Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning. In International Conference on Machine Learning, 2025
work page 2025
-
[83]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[84]
Saturn: Sat-based reinforcement learning to unleash llms reasoning
Huanyu Liu, Ge Li, Jia Li, Hao Zhu, Kechi Zhang, and Yihong Dong. Saturn: Sat-based reinforcement learning to unleash llms reasoning. arXiv preprint arXiv:2505.16368, 2025 a
-
[85]
Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond
Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, and Junxian He. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond. arXiv preprint arXiv:2505.19641, 2025 b
-
[86]
Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, and Xunliang Cai. R-horizon: How far can your large reasoning model really go in breadth and depth? arXiv preprint arXiv:2510.08189, 2025
-
[87]
Reft: Reasoning with reinforced fine-tuning, 2024
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024
-
[88]
h1: Bootstrapping llms to reason over longer horizons via reinforcement learning
Sumeet Ramesh Motwani, Alesia Ivanova, Ziyang Cai, Philip Torr, Riashat Islam, Shital Shah, Christian Schroeder de Witt, and Charles London. h1: Bootstrapping llms to reason over longer horizons via reinforcement learning. arXiv preprint arXiv:2510.07312, 2025
-
[89]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[90]
Are language models efficient reasoners? a perspective from logic programming
Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, and Bernhard Sch \"o lkopf. Are language models efficient reasoners? a perspective from logic programming. In Advances in Neural Information Processing Systems, 2025
work page 2025
-
[91]
Reasoning models reason well, until they don't
Revanth Rameshkumar, Jimson Huang, Yunxin Sun, Fei Xia, and Abulhair Saparov. Reasoning models reason well, until they don't. arXiv preprint arXiv:2510.22371, 2025
-
[92]
seqbench: A tunable benchmark to quantify sequential reasoning limits of llms, 2025
Mohammad Ramezanali, Mo Vazifeh, and Paolo Santi. seqbench: A tunable benchmark to quantify sequential reasoning limits of llms, 2025. URL https://arxiv.org/abs/2509.16866
-
[93]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[94]
Transformers struggle to learn to search
Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, and He He. Transformers struggle to learn to search. In International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=qFVVBzXxR2V
work page 2025
-
[95]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[96]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv:2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[97]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[98]
Reasoninggym: Reasoningenvironmentsforreinforcementlearningwithverifiable rewards, 2025
Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, and Andreas K \"o pf. Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards. arXiv preprint arXiv:2505.24760, 2025
-
[99]
Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning. arXiv preprint arXiv:2509.25300, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[100]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[101]
Qwen3.5: Accelerating productivity with native multimodal agents, February 2026
Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5
work page 2026
-
[102]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[103]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024
work page 2024
-
[104]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[105]
Enhance reasoning for large language models in the game werewolf, 2024
Shuang Wu, Liwen Zhu, Tao Yang, Shiwei Xu, Qiang Fu, Yang Wei, and Haobo Fu. Enhance reasoning for large language models in the game werewolf. arXiv preprint arXiv:2402.02330, 2024
-
[106]
arXiv preprint arXiv:2502.14768 , year=
Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768, 2025
-
[107]
An Yang et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[108]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[109]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[110]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025
work page internal anchor Pith review arXiv 2025
-
[111]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[112]
Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. Gsm-infinite: How do your llms behave over infinitely increasing context length and reasoning complexity? In Proceedings of the 42nd International Conference on Machine Learning, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.