pith. machine review for the scientific record.
sign in

arxiv: 2509.23629 · v3 · submitted 2025-09-28 · 💻 cs.AI · cond-mat.dis-nn· cond-mat.stat-mech· cs.LG· physics.soc-ph

Emergent Slow Thinking in LLMs as Inverse Tree Freezing

Pith reviewed 2026-05-18 12:39 UTC · model grok-4.3

classification 💻 cs.AI cond-mat.dis-nncond-mat.stat-mechcs.LGphysics.soc-ph
keywords reinforcement learningverifiable rewardslarge language modelsslow thinkingconcept networkinverse treesannealed RLVRtraining dynamics
0
0 comments X

The pith

RLVR causes LLMs to develop slow thinking by freezing concept networks into inverse trees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper offers a statistical-physics account of how reinforcement learning with verifiable rewards lets large language models learn multi-step reasoning from only final-answer feedback. It argues that any autoregressive model must compress its vast prefix space into a Markov network of predictive states, called the Concept Network. On this network, RLVR operates through two processes: merging paths that can be combined and creating competition between those that cannot. These drive the network to nucleate, grow, and freeze into directed inverse trees with multiple inputs leading to single outputs. This view matches the training curves of a 1.5-billion-parameter model and leads to a new training method, Annealed-RLVR, that inserts a short supervised fine-tuning step at the height of frustration and beats standard RLVR, particularly when many answers are sampled.

Core claim

The central discovery is that slow thinking in LLMs emerges as the Concept Network freezes into multi-input, single-output directed inverse trees under RLVR dynamics. Path merging of compatible reasoning steps and frustrated competition among incompatible ones govern the evolution through nucleation, growth, and freezing stages. This structural picture reproduces the observed training dynamics of a 1.5B parameter LLM and generates specific predictions about lengthening reasoning chains, the timing dependence of supervised fine-tuning effects, and policy collapse under high frustration. The resulting Annealed-RLVR procedure, which applies brief SFT exactly at maximum frustration, delivers in-

What carries the argument

The Concept Network (CoNet), the Markov network of predictive states into which the autoregressive model compresses its prefix space, on which slow thinking acts as a random walk that RLVR then organizes into inverse trees via merging and competition.

If this is right

  • Reasoning chains lengthen geometrically due to the sparse topology of the frozen inverse trees.
  • Applying supervised fine-tuning after the trees have frozen causes catastrophic forgetting through rupture of bridge nodes.
  • High frustration during training drives the policy to collapse.
  • Annealed-RLVR improves benchmark scores over standard RLVR on both in-distribution and out-of-distribution tasks, with gains largest at high sampling budgets where standard methods fail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The timing of interventions relative to the freezing transition may generalize to other training regimes that aim to preserve or enhance reasoning structures.
  • If the inverse-tree topology holds, then methods to directly encourage or measure such structures could accelerate the emergence of reliable multi-step reasoning.
  • Connections to physical systems with similar nucleation and freezing dynamics might offer new ways to analyze or control LLM training trajectories.

Load-bearing premise

An autoregressive model's finite capacity must compress its exponentially large prefix space into a Markov network of predictive states on which reasoning unfolds as a random walk.

What would settle it

A direct test would be to check whether the internal concept activations or state transitions during RLVR training exhibit the predicted sequence of nucleation, growth, and freezing into inverse-tree structures; absence of such patterns or failure of Annealed-RLVR to outperform at high sampling budgets would challenge the account.

Figures

Figures reproduced from arXiv: 2509.23629 by Kun Chen, Linfeng Zhang, Pan Zhang, Sihan Hu, Xiansheng Cai, Youjin Deng, Yuan Huang, Zhiyuan Yao.

Figure 1
Figure 1. Figure 1: CoNet Reproduces Core LLM Training Dynamics. The minimal CoNet model (a) reproduces the two core empirical signatures of RLVR training observed in the DeepScaleR-1.5B LLM (b). These signatures are: (i) a two-stage reward dynamic consisting of a fast-learning stage followed by a slow-learning phase (top panels), and (ii) a non-monotonic, V-shaped evolution of the correct response length (bottom panels). Thi… view at source ↗
Figure 2
Figure 2. Figure 2: Visualizing the Structural Evolution from “Skill Islands" to the Concept Web. These net￾work snapshots provide a direct, qualitative narrative of the concept web’s formation, corresponding to the different phases of training. In each subfigure, green, red, and blue dots indicate the question, answer, and intermediate nodes of the CoNet, respectively. The color of each directed edge represents the transitio… view at source ↗
Figure 3
Figure 3. Figure 3: A Sparse-Web Structure Necessitates Longer Reasoning Chains. This figure links the emergent sparse topology of the concept web to the observed increase in response length. (b, c) The color and marker scheme is identical to that used in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The Fragility of a Sparse Web: Catastrophic Forgetting and Fast Recovery. This figure demonstrates a key prediction of our sparse-web hypothesis: that the concept web is fragile, relying on critical bridge-like connections. (c-e) The color and shape conventions follow those of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Microscopic Mechanisms of RLVR: Forgetting by Frustration, Learning by Phase Transition The learning trajectories for individual problems in CoNet (a) and the 1.5B LLM (b) reveal a fundamental duality. (i) Frustration-Induced Forgetting: At the onset of slow learning [see inset in (a)], the intense competition for connections on a sparse web manifests as volatile, non-monotonic accuracy curves, where some … view at source ↗
Figure 6
Figure 6. Figure 6: Annealed-RLVR Outperforms Standard RLVR. This figure compares the best@k accuracy curves of our Annealed-RLVR ("Annealed") with the standard RLVR baseline ("RLVR"). The evalua￾tion is performed on (a) an in-distribution set (Randomly Selected 512 Training Problems) and two out￾of-distribution (OOD) sets: (b) the Minerva dataset and (c) the AIME 2024/2025 datasets. The results show that Annealed-RLVR consis… view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the LLM reasoning process. A chain-of-thought is composed of stable, low-entropy [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: From Intractable LLM Dynamics to a Minimal CoNet Model. The reasoning process in an LLM (left) can be viewed as a traversal on a high-dimensional latent graph, where path probabilities are determined by the complex, autoregressive Transformer. Directly analyzing this graph’s evolution is intractable. CoNet (right) provides a minimal abstraction by replacing the intractable latent graph with a fixed K-regul… view at source ↗
Figure 9
Figure 9. Figure 9: The Formation of the Concept Web in CoNet. This figure illustrates the evolution of network clusters during training, showing a structural reorganization from isolated skills to a unified conceptual web. In the initial phase, the Cluster Number (orange curve) spikes, indicating the rapid discovery of numerous disconnected “skill islands”. The peak marks a critical structural juncture. Following this peak, … view at source ↗
Figure 10
Figure 10. Figure 10: Per-Problem Critical Dynamics During the Slow-Learning Stage. These plots reveal the microscopic learning dynamics for a representative subset of problems that successfully learn during the slow-learning stage, providing evidence for localized phase transitions. (Left) Accuracy trajectories for these problems. All problems shown exhibit a sharp, sigmoidal-like increase in accuracy rate in the training pro… view at source ↗
Figure 11
Figure 11. Figure 11: Structural Impact of Annealed-RLVR on the Concept Web. Evolution of concept web for standard RLVR (blue) versus Annealed-RLVR (orange). The SFT intervention at step 50 (dashed line) induces an immediate drop in the Annealed model’s cluster size. Subsequently, the Annealed model recovers and surpasses the standard RLVR baseline, which exhibits slower growth, ultimately forming a larger final concept web. T… view at source ↗
Figure 12
Figure 12. Figure 12: Validation of Annealed-RLVR in CoNet. (Left) The histogram of per-problem accuracy shows that the annealed model (SFT-CKPT750) significantly reduces the number of completely unsolved problems (Accuracy = 0.0) and increases the number of mastered problems (Accuracy = 1.0) compared to the standard model (CKPT800). (Right) The line plot of training dynamics shows that immediately after the SFT inter￾vention … view at source ↗
Figure 13
Figure 13. Figure 13: Mechanism of Annealed-RLVR: Accuracy Distributions. This figure complements the best@k curves in the main text by revealing the underlying mechanism of performance improvement. It shows the per-problem accuracy histograms for the checkpoints of Annealed-RLVR ("Annealed") versus the standard RLVR ("RLVR") baseline, evaluated on (a) the in-distribution set (Random 512), (b) the OOD Min￾erva dataset, and (c)… view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) enables large language models to acquire slow, multi-step reasoning from sparse final-answer signals. We provide a statistical-physics picture of this emergence. We show that an autoregressive model's finite capacity forces it to compress its exponentially large prefix space into a Markov network of predictive states, on which slow thinking unfolds as a random walk -- the Concept Network (CoNet) picture. Within CoNet, RLVR dynamics are governed by two mechanisms: merging of compatible paths and frustrated competition among incompatible ones. Together they drive the network through nucleation, growth, and freezing into multi-input, single-output directed inverse trees. The picture reproduces the training dynamics of a 1.5-billion-parameter LLM and yields three predictions: reasoning chains lengthen as a geometric necessity of sparse topology; SFT induces catastrophic forgetting through bridge-node rupture; and frustration drives policy collapse. Building on the structural timing inherent in inverse-tree freezing, we propose Annealed-RLVR -- a brief SFT intervention at the moment of maximum frustration. It outperforms standard RLVR on both in- and out-of-distribution benchmarks, with the largest gains at high sampling budgets where standard RLVR collapses. The same SFT applied after the trees freeze instead triggers catastrophic forgetting, isolating timing as the active ingredient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper claims that finite autoregressive capacity in LLMs compresses the prefix space into a Markovian Concept Network (CoNet) on which RLVR proceeds via path merging and frustrated competition, driving nucleation, growth, and freezing into multi-input single-output inverse trees. This framework is asserted to reproduce the training dynamics of a 1.5B-parameter LLM, explain phenomena such as lengthening reasoning chains and policy collapse, and motivate Annealed-RLVR (brief SFT at maximum frustration), which outperforms standard RLVR on in- and out-of-distribution benchmarks with largest gains at high sampling budgets.

Significance. If the central claims hold, the work supplies a statistical-physics account of slow-thinking emergence under RLVR, together with three falsifiable predictions and an empirically validated intervention (Annealed-RLVR) that improves performance where standard RLVR collapses. The combination of a structural explanation, reproduction of observed 1.5B-LLM curves, and a timing-specific SFT method would be a substantive contribution to the theory and practice of reasoning-oriented post-training.

major comments (4)
  1. [Abstract / CoNet introduction] Abstract and the section introducing the CoNet picture: the claim that finite autoregressive capacity 'necessarily' compresses the exponentially large prefix space into a Markov network of predictive states is asserted without derivation or explicit mapping from the autoregressive loss to the claimed network topology; it is therefore unclear whether this compression is required for the subsequent RLVR dynamics or inverse-tree freezing.
  2. [RLVR dynamics section] Abstract and the section on RLVR dynamics: no explicit mapping is supplied from the RLVR loss or policy gradient to the mechanisms of path merging and frustrated competition, nor is there a demonstration that observed length increases, forgetting, or collapse are geometric consequences of the inverse-tree topology rather than artifacts of the optimizer, reward sparsity, or sampling procedure.
  3. [Abstract] Abstract: the statement that the picture 'reproduces the training dynamics of a 1.5-billion-parameter LLM' is made without equations, error bars, exclusion criteria, or a description of how the free parameter (frustration timing threshold) was set, leaving open the possibility that the reproduction is post-hoc calibration rather than an independent test.
  4. [Annealed-RLVR section] The section describing Annealed-RLVR: the timing of the brief SFT intervention at 'maximum frustration' appears to be identified from the same training curves that the model is said to reproduce, creating a circularity concern for the claim that timing is the active ingredient.
minor comments (2)
  1. [Predictions section] Notation for 'inverse trees' and 'bridge-node rupture' should be defined more precisely with reference to the underlying graph structure before being used in the predictions.
  2. [Predictions] The manuscript would benefit from an explicit statement of the three predictions in a numbered list with corresponding experimental tests.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed report, which highlights both the potential significance of the work and areas where additional rigor would strengthen the presentation. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / CoNet introduction] Abstract and the section introducing the CoNet picture: the claim that finite autoregressive capacity 'necessarily' compresses the exponentially large prefix space into a Markov network of predictive states is asserted without derivation or explicit mapping from the autoregressive loss to the claimed network topology; it is therefore unclear whether this compression is required for the subsequent RLVR dynamics or inverse-tree freezing.

    Authors: We agree that an explicit derivation would improve clarity. In the revised manuscript we will add a new subsection that starts from the autoregressive cross-entropy objective and shows how finite capacity induces equivalence classes of prefixes that share identical conditional distributions; these classes form the nodes of the Markovian Concept Network. We will also state explicitly that this compression is a prerequisite for RLVR to operate on a tractable state space rather than the full prefix tree. revision: yes

  2. Referee: [RLVR dynamics section] Abstract and the section on RLVR dynamics: no explicit mapping is supplied from the RLVR loss or policy gradient to the mechanisms of path merging and frustrated competition, nor is there a demonstration that observed length increases, forgetting, or collapse are geometric consequences of the inverse-tree topology rather than artifacts of the optimizer, reward sparsity, or sampling procedure.

    Authors: We will insert a new paragraph that maps the policy-gradient update directly to the two mechanisms: compatible trajectories that share reward signals reinforce the same nodes (path merging), while incompatible trajectories compete for limited node capacity (frustrated competition). We will further show, via a simple topological argument on the emerging inverse-tree structure, that chain lengthening and eventual collapse follow from the single-output constraint independently of the specific optimizer or reward sparsity, and we will contrast this with control experiments that vary sampling temperature. revision: yes

  3. Referee: [Abstract] Abstract: the statement that the picture 'reproduces the training dynamics of a 1.5-billion-parameter LLM' is made without equations, error bars, exclusion criteria, or a description of how the free parameter (frustration timing threshold) was set, leaving open the possibility that the reproduction is post-hoc calibration rather than an independent test.

    Authors: We will expand the experimental section with the precise functional form used to generate the model curves, report error bars across three independent runs, state the exclusion criteria for outlier seeds, and document that the frustration threshold was fixed by the theoretical prediction of the nucleation-to-freezing transition rather than by fitting to the observed curves. A sensitivity plot will be added to demonstrate robustness. revision: yes

  4. Referee: [Annealed-RLVR section] The section describing Annealed-RLVR: the timing of the brief SFT intervention at 'maximum frustration' appears to be identified from the same training curves that the model is said to reproduce, creating a circularity concern for the claim that timing is the active ingredient.

    Authors: We maintain that the timing is not circular: the theory predicts a distinct peak in frustration immediately prior to inverse-tree freezing, and the empirical curves are used only to locate this theoretically predicted moment for the intervention. The claim that timing is the active ingredient is supported by the post-freeze SFT control, which produces forgetting, and by the fact that the performance gains are measured on held-out in- and out-of-distribution benchmarks. We will add an explicit paragraph separating the theoretical timing prediction from its empirical application. revision: partial

Circularity Check

0 steps flagged

No significant circularity in CoNet derivation or inverse-tree predictions

full rationale

The paper derives the Concept Network (CoNet) as a direct consequence of finite autoregressive capacity compressing an exponentially large prefix space into a Markov network of predictive states, then analyzes RLVR dynamics via path merging and frustrated competition that nucleate, grow, and freeze into multi-input single-output inverse trees. This framework is used to interpret observed phenomena and to motivate the timing of the Annealed-RLVR SFT intervention at maximum frustration. The reproduction of 1.5B-LLM training dynamics serves as validation of the picture rather than a fitted input from which predictions are mechanically extracted; the three listed predictions (chain lengthening as geometric necessity, SFT-induced forgetting via bridge-node rupture, frustration-driven collapse) are presented as topological consequences independent of parameter tuning to the target curves. Benchmark outperformance at high sampling budgets supplies external falsifiability. No equation or step reduces by construction to its own inputs, and the central claim retains independent content beyond the data it reproduces.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that finite capacity forces Markov compression, the ad-hoc introduction of CoNet and inverse trees as explanatory entities, and at least one timing parameter that must be located from training curves.

free parameters (1)
  • frustration timing threshold
    The moment of maximum frustration at which brief SFT is applied is located by inspecting the same training dynamics the model is calibrated to reproduce.
axioms (1)
  • domain assumption Finite-capacity autoregressive models compress exponentially large prefix spaces into Markov networks of predictive states
    Invoked in the abstract to justify the CoNet picture and the subsequent random-walk description of slow thinking.
invented entities (2)
  • Concept Network (CoNet) no independent evidence
    purpose: Compressed Markov network on which slow thinking occurs as a random walk
    New conceptual object introduced to organize the path-merging and competition dynamics.
  • Inverse trees no independent evidence
    purpose: Multi-input single-output directed structures that emerge after freezing
    Postulated final topology that explains lengthening reasoning chains and policy behavior.

pith-pipeline@v0.9.0 · 5797 in / 1622 out tokens · 45893 ms · 2026-05-18T12:39:42.511117+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 11 internal anchors

  1. [1]

    P. W. Anderson. More Is Different . Science, 177 0 (4047): 0 393--396, August 1972. doi:10.1126/science.177.4047.393

  2. [2]

    Emergence of scaling in random networks

    Albert-L \'a szl \'o Barab \'a si and R \'e ka Albert. Emergence of scaling in random networks. science, 286 0 (5439): 0 509--512, 1999

  3. [3]

    verl: Volcano engine reinforcement learning for llms

    ByteDance Seed Team and verl community . verl: Volcano engine reinforcement learning for llms. https://github.com/volcengine/verl

  4. [4]

    Iteration head: A mechanistic study of chain-of-thought, 2024

    Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Alice Yang, Francois Charton, and Julia Kempe. Iteration head: A mechanistic study of chain-of-thought, 2024. URL https://arxiv.org/abs/2406.02128

  5. [5]

    Learning-at-criticality in large language models for quantum field theory and beyond, 2025

    Xiansheng Cai, Sihan Hu, Tao Wang, Yuan Huang, Pan Zhang, Youjin Deng, and Kun Chen. Learning-at-criticality in large language models for quantum field theory and beyond, 2025. URL https://arxiv.org/abs/2506.03703

  6. [6]

    Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025. URL https://arxiv.org/abs/2508.10751

  7. [7]

    Reasoning with Exploration: An Entropy Perspective

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective on reinforcement learning for llms, 2025. URL https://arxiv.org/abs/2506.14758

  8. [8]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URL https://arxiv.org/abs/2505.22617

  9. [9]

    Improved supervised fine-tuning for large language models to mitigate catastrophic forgetting, 2025

    Fei Ding and Baiqiao Wang. Improved supervised fine-tuning for large language models to mitigate catastrophic forgetting, 2025. URL https://arxiv.org/abs/2506.09428

  10. [10]

    How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, 2024

    Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, and Tanmoy Chakraborty. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, 2024. URL https://arxiv.org/abs/2402.18312

  11. [11]

    Subgraph centrality in complex networks

    Ernesto Estrada and Naomichi Hatano. Subgraph centrality in complex networks. Physical Review E, 77 0 (3): 0 036111, 2008

  12. [12]

    Concise reasoning via reinforcement learning, 2025

    Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, and Kartik Talamadupula. Concise reasoning via reinforcement learning, 2025. URL https://arxiv.org/abs/2504.05185

  13. [13]

    Mitigating forgetting in llm supervised fine-tuning and preference learning, 2025

    Heshan Fernando, Han Shen, Parikshit Ram, Yi Zhou, Horst Samulowitz, Nathalie Baracaldo, and Tianyi Chen. Mitigating forgetting in llm supervised fine-tuning and preference learning, 2025. URL https://arxiv.org/abs/2410.15483

  14. [14]

    A complex network approach to topic models, 2023

    Javier Ferrando, Mehran Rezagholizadeh, and VS Dinesh Chandra Prabhu. A complex network approach to topic models, 2023

  15. [15]

    Fisher and Michael N

    Michael E. Fisher and Michael N. Barber. Scaling Theory for Finite-Size Effects in the Critical Region . Physical Review Letters, 28 0 (23): 0 1516--1519, June 1972. doi:10.1103/PhysRevLett.28.1516

  16. [16]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  17. [17]

    Skywork open reasoner series

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680, 2025. Notion Blog

  18. [18]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

  19. [19]

    RL Fine-Tuning Heals OOD Forgetting in SFT

    Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, and Mohammad Hamdaqa. Rl fine-tuning heals ood forgetting in sft, 2025. URL https://arxiv.org/abs/2509.12235

  20. [20]

    Thinking, fast and slow

    Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011

  21. [21]

    Optimization by simulated annealing

    Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. science, 220 0 (4598): 0 671--680, 1983

  22. [22]

    Computerrl: Scaling end-to-end online reinforcement learning for computer use agents, 2025

    Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang. Computerrl: Scaling end-to-end online reinforcement learning for computer use agents, 2025. URL https://arxiv.org/abs/2508.14040

  23. [23]

    Solving quantitative reasoning problems with language models, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022

  24. [24]

    Revisiting catastrophic forgetting in large language model tuning

    Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 4297--4308, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.1865...

  25. [25]

    The emergence of essential structures in supervised learning

    Huan Li, Ziqiao Liu, Yiding Li, and Qing-Fu Zhang. The emergence of essential structures in supervised learning. Nature Communications, 14 0 (1): 0 6483, 2023

  26. [26]

    Learning without Forgetting

    Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URL https://arxiv.org/abs/1606.09282

  27. [27]

    Network dynamics-based framework for understanding deep neural networks, 2025

    Yuchen Lin, Yong Zhang, Sihan Feng, and Hong Zhao. Network dynamics-based framework for understanding deep neural networks, 2025. URL https://arxiv.org/abs/2501.02436

  28. [28]

    Understanding R1-Zero-Like Training : A Critical Perspective , March 2025

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like Training : A Critical Perspective , March 2025

  29. [29]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, ...

  30. [30]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Kevin Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Peter Andersen, Ishan Misra, Shubham Singh, et al. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems, volume 36, 2023

  31. [31]

    Topology of reasoning: Understanding large reasoning models through reasoning graph properties, 2025

    Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. Topology of reasoning: Understanding large reasoning models through reasoning graph properties, 2025

  32. [32]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. URL https://arxiv.org/abs/2201.02177

  33. [33]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the Limits of Mathematical Reasoning in Open Language Models , April 2024. URL http://arxiv.org/abs/2402.03300. arXiv:2402.03300 [cs]

  34. [34]

    RL's Razor: Why Online Reinforcement Learning Forgets Less

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl's razor: Why online reinforcement learning forgets less, 2025. URL https://arxiv.org/abs/2509.04259

  35. [35]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

  36. [36]

    Reflexion: An autonomous agent with dynamic memory and self-reflection, 2023

    Noah Shinn, Beck Labash, and Roger Grosse. Reflexion: An autonomous agent with dynamic memory and self-reflection, 2023

  37. [37]

    Introduction to phase transitions and critical phenomena

    Harry Eugene Stanley and Guenter Ahlers. Introduction to phase transitions and critical phenomena. American Journal of Physics, 40: 0 927--928, 1972. URL https://api.semanticscholar.org/CorpusID:10416417

  38. [38]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs , March 2025

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, et al. Kimi k1.5: Scaling Reinforcement Learning with LLMs , March 2025

  39. [39]

    A multiscale visualization of attention in the transformer model, 2019

    Jesse Vig. A multiscale visualization of attention in the transformer model, 2019

  40. [40]

    Understanding reasoning ability of language models from the perspective of reasoning paths aggregation, 2024

    Xinyi Wang, Alfonso Amayuelas, Kexun Zhang, Liangming Pan, Wenhu Chen, and William Yang Wang. Understanding reasoning ability of language models from the perspective of reasoning paths aggregation, 2024. URL https://arxiv.org/abs/2402.03268

  41. [41]

    Self-consistency improves chain of thought reasoning in language models, 2022

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2022

  42. [42]

    Collective dynamics of ‘small-world’networks

    Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature, 393 0 (6684): 0 440--442, 1998

  43. [43]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pp.\ 24824--24837, 2022

  44. [44]

    K. G. Wilson and John B. Kogut. The Renormalization group and the epsilon expansion . Phys. Rept., 12: 0 75--199, 1974. doi:10.1016/0370-1573(74)90023-4

  45. [45]

    Kenneth G. Wilson. The renormalization group and critical phenomena. Reviews of Modern Physics, 55 0 (3): 0 583--600, July 1983. doi:10.1103/RevModPhys.55.583

  46. [46]

    Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought, 2025

    Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, and Chelsea Finn. Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought, 2025. URL https://arxiv.org/abs/2501.04682

  47. [47]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  48. [48]

    Tree of thoughts: Deliberate problem solving with large language models, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Sha, Silvio Savarese, and an an. Tree of thoughts: Deliberate problem solving with large language models, 2023

  49. [49]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, et al. DAPO : An Open - Source LLM Reinforcement Learning System at Scale , March 2025. URL http://arxiv.org/abs/2503.14476. arXiv:2503.14476 [cs] version: 1

  50. [50]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837

  51. [51]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  52. [52]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  53. [53]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  54. [54]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...