arxiv: 2509.23629 · v3 · submitted 2025-09-28 · 💻 cs.AI · cond-mat.dis-nn· cond-mat.stat-mech· cs.LG· physics.soc-ph

Emergent Slow Thinking in LLMs as Inverse Tree Freezing

Sihan Hu , Xiansheng Cai , Yuan Huang , Zhiyuan Yao , Linfeng Zhang , Pan Zhang , Youjin Deng , Kun Chen This is my paper

Pith reviewed 2026-05-18 12:39 UTC · model grok-4.3

classification 💻 cs.AI cond-mat.dis-nncond-mat.stat-mechcs.LGphysics.soc-ph

keywords reinforcement learningverifiable rewardslarge language modelsslow thinkingconcept networkinverse treesannealed RLVRtraining dynamics

0 comments

The pith

RLVR causes LLMs to develop slow thinking by freezing concept networks into inverse trees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper offers a statistical-physics account of how reinforcement learning with verifiable rewards lets large language models learn multi-step reasoning from only final-answer feedback. It argues that any autoregressive model must compress its vast prefix space into a Markov network of predictive states, called the Concept Network. On this network, RLVR operates through two processes: merging paths that can be combined and creating competition between those that cannot. These drive the network to nucleate, grow, and freeze into directed inverse trees with multiple inputs leading to single outputs. This view matches the training curves of a 1.5-billion-parameter model and leads to a new training method, Annealed-RLVR, that inserts a short supervised fine-tuning step at the height of frustration and beats standard RLVR, particularly when many answers are sampled.

Core claim

The central discovery is that slow thinking in LLMs emerges as the Concept Network freezes into multi-input, single-output directed inverse trees under RLVR dynamics. Path merging of compatible reasoning steps and frustrated competition among incompatible ones govern the evolution through nucleation, growth, and freezing stages. This structural picture reproduces the observed training dynamics of a 1.5B parameter LLM and generates specific predictions about lengthening reasoning chains, the timing dependence of supervised fine-tuning effects, and policy collapse under high frustration. The resulting Annealed-RLVR procedure, which applies brief SFT exactly at maximum frustration, delivers in-

What carries the argument

The Concept Network (CoNet), the Markov network of predictive states into which the autoregressive model compresses its prefix space, on which slow thinking acts as a random walk that RLVR then organizes into inverse trees via merging and competition.

If this is right

Reasoning chains lengthen geometrically due to the sparse topology of the frozen inverse trees.
Applying supervised fine-tuning after the trees have frozen causes catastrophic forgetting through rupture of bridge nodes.
High frustration during training drives the policy to collapse.
Annealed-RLVR improves benchmark scores over standard RLVR on both in-distribution and out-of-distribution tasks, with gains largest at high sampling budgets where standard methods fail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The timing of interventions relative to the freezing transition may generalize to other training regimes that aim to preserve or enhance reasoning structures.
If the inverse-tree topology holds, then methods to directly encourage or measure such structures could accelerate the emergence of reliable multi-step reasoning.
Connections to physical systems with similar nucleation and freezing dynamics might offer new ways to analyze or control LLM training trajectories.

Load-bearing premise

An autoregressive model's finite capacity must compress its exponentially large prefix space into a Markov network of predictive states on which reasoning unfolds as a random walk.

What would settle it

A direct test would be to check whether the internal concept activations or state transitions during RLVR training exhibit the predicted sequence of nucleation, growth, and freezing into inverse-tree structures; absence of such patterns or failure of Annealed-RLVR to outperform at high sampling budgets would challenge the account.

Figures

Figures reproduced from arXiv: 2509.23629 by Kun Chen, Linfeng Zhang, Pan Zhang, Sihan Hu, Xiansheng Cai, Youjin Deng, Yuan Huang, Zhiyuan Yao.

**Figure 1.** Figure 1: CoNet Reproduces Core LLM Training Dynamics. The minimal CoNet model (a) reproduces the two core empirical signatures of RLVR training observed in the DeepScaleR-1.5B LLM (b). These signatures are: (i) a two-stage reward dynamic consisting of a fast-learning stage followed by a slow-learning phase (top panels), and (ii) a non-monotonic, V-shaped evolution of the correct response length (bottom panels). Thi… view at source ↗

**Figure 2.** Figure 2: Visualizing the Structural Evolution from “Skill Islands" to the Concept Web. These network snapshots provide a direct, qualitative narrative of the concept web’s formation, corresponding to the different phases of training. In each subfigure, green, red, and blue dots indicate the question, answer, and intermediate nodes of the CoNet, respectively. The color of each directed edge represents the transitio… view at source ↗

**Figure 3.** Figure 3: A Sparse-Web Structure Necessitates Longer Reasoning Chains. This figure links the emergent sparse topology of the concept web to the observed increase in response length. (b, c) The color and marker scheme is identical to that used in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The Fragility of a Sparse Web: Catastrophic Forgetting and Fast Recovery. This figure demonstrates a key prediction of our sparse-web hypothesis: that the concept web is fragile, relying on critical bridge-like connections. (c-e) The color and shape conventions follow those of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Microscopic Mechanisms of RLVR: Forgetting by Frustration, Learning by Phase Transition The learning trajectories for individual problems in CoNet (a) and the 1.5B LLM (b) reveal a fundamental duality. (i) Frustration-Induced Forgetting: At the onset of slow learning [see inset in (a)], the intense competition for connections on a sparse web manifests as volatile, non-monotonic accuracy curves, where some … view at source ↗

**Figure 6.** Figure 6: Annealed-RLVR Outperforms Standard RLVR. This figure compares the best@k accuracy curves of our Annealed-RLVR ("Annealed") with the standard RLVR baseline ("RLVR"). The evaluation is performed on (a) an in-distribution set (Randomly Selected 512 Training Problems) and two outof-distribution (OOD) sets: (b) the Minerva dataset and (c) the AIME 2024/2025 datasets. The results show that Annealed-RLVR consis… view at source ↗

**Figure 7.** Figure 7: Illustration of the LLM reasoning process. A chain-of-thought is composed of stable, low-entropy [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: From Intractable LLM Dynamics to a Minimal CoNet Model. The reasoning process in an LLM (left) can be viewed as a traversal on a high-dimensional latent graph, where path probabilities are determined by the complex, autoregressive Transformer. Directly analyzing this graph’s evolution is intractable. CoNet (right) provides a minimal abstraction by replacing the intractable latent graph with a fixed K-regul… view at source ↗

**Figure 9.** Figure 9: The Formation of the Concept Web in CoNet. This figure illustrates the evolution of network clusters during training, showing a structural reorganization from isolated skills to a unified conceptual web. In the initial phase, the Cluster Number (orange curve) spikes, indicating the rapid discovery of numerous disconnected “skill islands”. The peak marks a critical structural juncture. Following this peak, … view at source ↗

**Figure 10.** Figure 10: Per-Problem Critical Dynamics During the Slow-Learning Stage. These plots reveal the microscopic learning dynamics for a representative subset of problems that successfully learn during the slow-learning stage, providing evidence for localized phase transitions. (Left) Accuracy trajectories for these problems. All problems shown exhibit a sharp, sigmoidal-like increase in accuracy rate in the training pro… view at source ↗

**Figure 11.** Figure 11: Structural Impact of Annealed-RLVR on the Concept Web. Evolution of concept web for standard RLVR (blue) versus Annealed-RLVR (orange). The SFT intervention at step 50 (dashed line) induces an immediate drop in the Annealed model’s cluster size. Subsequently, the Annealed model recovers and surpasses the standard RLVR baseline, which exhibits slower growth, ultimately forming a larger final concept web. T… view at source ↗

**Figure 12.** Figure 12: Validation of Annealed-RLVR in CoNet. (Left) The histogram of per-problem accuracy shows that the annealed model (SFT-CKPT750) significantly reduces the number of completely unsolved problems (Accuracy = 0.0) and increases the number of mastered problems (Accuracy = 1.0) compared to the standard model (CKPT800). (Right) The line plot of training dynamics shows that immediately after the SFT intervention … view at source ↗

**Figure 13.** Figure 13: Mechanism of Annealed-RLVR: Accuracy Distributions. This figure complements the best@k curves in the main text by revealing the underlying mechanism of performance improvement. It shows the per-problem accuracy histograms for the checkpoints of Annealed-RLVR ("Annealed") versus the standard RLVR ("RLVR") baseline, evaluated on (a) the in-distribution set (Random 512), (b) the OOD Minerva dataset, and (c)… view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) enables large language models to acquire slow, multi-step reasoning from sparse final-answer signals. We provide a statistical-physics picture of this emergence. We show that an autoregressive model's finite capacity forces it to compress its exponentially large prefix space into a Markov network of predictive states, on which slow thinking unfolds as a random walk -- the Concept Network (CoNet) picture. Within CoNet, RLVR dynamics are governed by two mechanisms: merging of compatible paths and frustrated competition among incompatible ones. Together they drive the network through nucleation, growth, and freezing into multi-input, single-output directed inverse trees. The picture reproduces the training dynamics of a 1.5-billion-parameter LLM and yields three predictions: reasoning chains lengthen as a geometric necessity of sparse topology; SFT induces catastrophic forgetting through bridge-node rupture; and frustration drives policy collapse. Building on the structural timing inherent in inverse-tree freezing, we propose Annealed-RLVR -- a brief SFT intervention at the moment of maximum frustration. It outperforms standard RLVR on both in- and out-of-distribution benchmarks, with the largest gains at high sampling budgets where standard RLVR collapses. The same SFT applied after the trees freeze instead triggers catastrophic forgetting, isolating timing as the active ingredient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a statistical-physics model in which RLVR compresses LLM states into a Markov CoNet that freezes into inverse trees, and claims a brief SFT at peak frustration beats standard RLVR, but the compression step itself is not derived from the training objective.

read the letter

The main takeaway is that this work offers a geometric story for why RL with verifiable rewards produces longer reasoning chains and eventual collapse in LLMs. It frames the process as path merging and frustrated competition inside a compressed Concept Network that nucleates and freezes into multi-input single-output inverse trees. On top of that they propose Annealed-RLVR, a short SFT intervention timed to the moment of maximum frustration, and report that it improves both in- and out-of-distribution performance over plain RLVR, with the biggest lift when sampling budgets are high.

Referee Report

4 major / 2 minor

Summary. The paper claims that finite autoregressive capacity in LLMs compresses the prefix space into a Markovian Concept Network (CoNet) on which RLVR proceeds via path merging and frustrated competition, driving nucleation, growth, and freezing into multi-input single-output inverse trees. This framework is asserted to reproduce the training dynamics of a 1.5B-parameter LLM, explain phenomena such as lengthening reasoning chains and policy collapse, and motivate Annealed-RLVR (brief SFT at maximum frustration), which outperforms standard RLVR on in- and out-of-distribution benchmarks with largest gains at high sampling budgets.

Significance. If the central claims hold, the work supplies a statistical-physics account of slow-thinking emergence under RLVR, together with three falsifiable predictions and an empirically validated intervention (Annealed-RLVR) that improves performance where standard RLVR collapses. The combination of a structural explanation, reproduction of observed 1.5B-LLM curves, and a timing-specific SFT method would be a substantive contribution to the theory and practice of reasoning-oriented post-training.

major comments (4)

[Abstract / CoNet introduction] Abstract and the section introducing the CoNet picture: the claim that finite autoregressive capacity 'necessarily' compresses the exponentially large prefix space into a Markov network of predictive states is asserted without derivation or explicit mapping from the autoregressive loss to the claimed network topology; it is therefore unclear whether this compression is required for the subsequent RLVR dynamics or inverse-tree freezing.
[RLVR dynamics section] Abstract and the section on RLVR dynamics: no explicit mapping is supplied from the RLVR loss or policy gradient to the mechanisms of path merging and frustrated competition, nor is there a demonstration that observed length increases, forgetting, or collapse are geometric consequences of the inverse-tree topology rather than artifacts of the optimizer, reward sparsity, or sampling procedure.
[Abstract] Abstract: the statement that the picture 'reproduces the training dynamics of a 1.5-billion-parameter LLM' is made without equations, error bars, exclusion criteria, or a description of how the free parameter (frustration timing threshold) was set, leaving open the possibility that the reproduction is post-hoc calibration rather than an independent test.
[Annealed-RLVR section] The section describing Annealed-RLVR: the timing of the brief SFT intervention at 'maximum frustration' appears to be identified from the same training curves that the model is said to reproduce, creating a circularity concern for the claim that timing is the active ingredient.

minor comments (2)

[Predictions section] Notation for 'inverse trees' and 'bridge-node rupture' should be defined more precisely with reference to the underlying graph structure before being used in the predictions.
[Predictions] The manuscript would benefit from an explicit statement of the three predictions in a numbered list with corresponding experimental tests.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed report, which highlights both the potential significance of the work and areas where additional rigor would strengthen the presentation. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract / CoNet introduction] Abstract and the section introducing the CoNet picture: the claim that finite autoregressive capacity 'necessarily' compresses the exponentially large prefix space into a Markov network of predictive states is asserted without derivation or explicit mapping from the autoregressive loss to the claimed network topology; it is therefore unclear whether this compression is required for the subsequent RLVR dynamics or inverse-tree freezing.

Authors: We agree that an explicit derivation would improve clarity. In the revised manuscript we will add a new subsection that starts from the autoregressive cross-entropy objective and shows how finite capacity induces equivalence classes of prefixes that share identical conditional distributions; these classes form the nodes of the Markovian Concept Network. We will also state explicitly that this compression is a prerequisite for RLVR to operate on a tractable state space rather than the full prefix tree. revision: yes
Referee: [RLVR dynamics section] Abstract and the section on RLVR dynamics: no explicit mapping is supplied from the RLVR loss or policy gradient to the mechanisms of path merging and frustrated competition, nor is there a demonstration that observed length increases, forgetting, or collapse are geometric consequences of the inverse-tree topology rather than artifacts of the optimizer, reward sparsity, or sampling procedure.

Authors: We will insert a new paragraph that maps the policy-gradient update directly to the two mechanisms: compatible trajectories that share reward signals reinforce the same nodes (path merging), while incompatible trajectories compete for limited node capacity (frustrated competition). We will further show, via a simple topological argument on the emerging inverse-tree structure, that chain lengthening and eventual collapse follow from the single-output constraint independently of the specific optimizer or reward sparsity, and we will contrast this with control experiments that vary sampling temperature. revision: yes
Referee: [Abstract] Abstract: the statement that the picture 'reproduces the training dynamics of a 1.5-billion-parameter LLM' is made without equations, error bars, exclusion criteria, or a description of how the free parameter (frustration timing threshold) was set, leaving open the possibility that the reproduction is post-hoc calibration rather than an independent test.

Authors: We will expand the experimental section with the precise functional form used to generate the model curves, report error bars across three independent runs, state the exclusion criteria for outlier seeds, and document that the frustration threshold was fixed by the theoretical prediction of the nucleation-to-freezing transition rather than by fitting to the observed curves. A sensitivity plot will be added to demonstrate robustness. revision: yes
Referee: [Annealed-RLVR section] The section describing Annealed-RLVR: the timing of the brief SFT intervention at 'maximum frustration' appears to be identified from the same training curves that the model is said to reproduce, creating a circularity concern for the claim that timing is the active ingredient.

Authors: We maintain that the timing is not circular: the theory predicts a distinct peak in frustration immediately prior to inverse-tree freezing, and the empirical curves are used only to locate this theoretically predicted moment for the intervention. The claim that timing is the active ingredient is supported by the post-freeze SFT control, which produces forgetting, and by the fact that the performance gains are measured on held-out in- and out-of-distribution benchmarks. We will add an explicit paragraph separating the theoretical timing prediction from its empirical application. revision: partial

Circularity Check

0 steps flagged

No significant circularity in CoNet derivation or inverse-tree predictions

full rationale

The paper derives the Concept Network (CoNet) as a direct consequence of finite autoregressive capacity compressing an exponentially large prefix space into a Markov network of predictive states, then analyzes RLVR dynamics via path merging and frustrated competition that nucleate, grow, and freeze into multi-input single-output inverse trees. This framework is used to interpret observed phenomena and to motivate the timing of the Annealed-RLVR SFT intervention at maximum frustration. The reproduction of 1.5B-LLM training dynamics serves as validation of the picture rather than a fitted input from which predictions are mechanically extracted; the three listed predictions (chain lengthening as geometric necessity, SFT-induced forgetting via bridge-node rupture, frustration-driven collapse) are presented as topological consequences independent of parameter tuning to the target curves. Benchmark outperformance at high sampling budgets supplies external falsifiability. No equation or step reduces by construction to its own inputs, and the central claim retains independent content beyond the data it reproduces.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that finite capacity forces Markov compression, the ad-hoc introduction of CoNet and inverse trees as explanatory entities, and at least one timing parameter that must be located from training curves.

free parameters (1)

frustration timing threshold
The moment of maximum frustration at which brief SFT is applied is located by inspecting the same training dynamics the model is calibrated to reproduce.

axioms (1)

domain assumption Finite-capacity autoregressive models compress exponentially large prefix spaces into Markov networks of predictive states
Invoked in the abstract to justify the CoNet picture and the subsequent random-walk description of slow thinking.

invented entities (2)

Concept Network (CoNet) no independent evidence
purpose: Compressed Markov network on which slow thinking occurs as a random walk
New conceptual object introduced to organize the path-merging and competition dynamics.
Inverse trees no independent evidence
purpose: Multi-input single-output directed structures that emerge after freezing
Postulated final topology that explains lengthening reasoning chains and policy behavior.

pith-pipeline@v0.9.0 · 5797 in / 1622 out tokens · 45893 ms · 2026-05-18T12:39:42.511117+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 11 internal anchors

[1]

P. W. Anderson. More Is Different . Science, 177 0 (4047): 0 393--396, August 1972. doi:10.1126/science.177.4047.393

work page doi:10.1126/science.177.4047.393 1972
[2]

Emergence of scaling in random networks

Albert-L \'a szl \'o Barab \'a si and R \'e ka Albert. Emergence of scaling in random networks. science, 286 0 (5439): 0 509--512, 1999

work page 1999
[3]

verl: Volcano engine reinforcement learning for llms

ByteDance Seed Team and verl community . verl: Volcano engine reinforcement learning for llms. https://github.com/volcengine/verl

work page
[4]

Iteration head: A mechanistic study of chain-of-thought, 2024

Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Alice Yang, Francois Charton, and Julia Kempe. Iteration head: A mechanistic study of chain-of-thought, 2024. URL https://arxiv.org/abs/2406.02128

work page arXiv 2024
[5]

Learning-at-criticality in large language models for quantum field theory and beyond, 2025

Xiansheng Cai, Sihan Hu, Tao Wang, Yuan Huang, Pan Zhang, Youjin Deng, and Kun Chen. Learning-at-criticality in large language models for quantum field theory and beyond, 2025. URL https://arxiv.org/abs/2506.03703

work page arXiv 2025
[6]

Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models, 2025. URL https://arxiv.org/abs/2508.10751

work page arXiv 2025
[7]

Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective on reinforcement learning for llms, 2025. URL https://arxiv.org/abs/2506.14758

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URL https://arxiv.org/abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Improved supervised fine-tuning for large language models to mitigate catastrophic forgetting, 2025

Fei Ding and Baiqiao Wang. Improved supervised fine-tuning for large language models to mitigate catastrophic forgetting, 2025. URL https://arxiv.org/abs/2506.09428

work page arXiv 2025
[10]

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, 2024

Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, and Tanmoy Chakraborty. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning, 2024. URL https://arxiv.org/abs/2402.18312

work page arXiv 2024
[11]

Subgraph centrality in complex networks

Ernesto Estrada and Naomichi Hatano. Subgraph centrality in complex networks. Physical Review E, 77 0 (3): 0 036111, 2008

work page 2008
[12]

Concise reasoning via reinforcement learning, 2025

Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, and Kartik Talamadupula. Concise reasoning via reinforcement learning, 2025. URL https://arxiv.org/abs/2504.05185

work page arXiv 2025
[13]

Mitigating forgetting in llm supervised fine-tuning and preference learning, 2025

Heshan Fernando, Han Shen, Parikshit Ram, Yi Zhou, Horst Samulowitz, Nathalie Baracaldo, and Tianyi Chen. Mitigating forgetting in llm supervised fine-tuning and preference learning, 2025. URL https://arxiv.org/abs/2410.15483

work page arXiv 2025
[14]

A complex network approach to topic models, 2023

Javier Ferrando, Mehran Rezagholizadeh, and VS Dinesh Chandra Prabhu. A complex network approach to topic models, 2023

work page 2023
[15]

Fisher and Michael N

Michael E. Fisher and Michael N. Barber. Scaling Theory for Finite-Size Effects in the Critical Region . Physical Review Letters, 28 0 (23): 0 1516--1519, June 1972. doi:10.1103/PhysRevLett.28.1516

work page doi:10.1103/physrevlett.28.1516 1972
[16]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[17]

Skywork open reasoner series

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680, 2025. Notion Blog

work page 2025
[18]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

work page 2021
[19]

RL Fine-Tuning Heals OOD Forgetting in SFT

Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, and Mohammad Hamdaqa. Rl fine-tuning heals ood forgetting in sft, 2025. URL https://arxiv.org/abs/2509.12235

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Thinking, fast and slow

Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011

work page 2011
[21]

Optimization by simulated annealing

Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. science, 220 0 (4598): 0 671--680, 1983

work page 1983
[22]

Computerrl: Scaling end-to-end online reinforcement learning for computer use agents, 2025

Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang. Computerrl: Scaling end-to-end online reinforcement learning for computer use agents, 2025. URL https://arxiv.org/abs/2508.14040

work page arXiv 2025
[23]

Solving quantitative reasoning problems with language models, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022

work page 2022
[24]

Revisiting catastrophic forgetting in large language model tuning

Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 4297--4308, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.1865...

work page doi:10.18653/v1/2024.findings-emnlp.249 2024
[25]

The emergence of essential structures in supervised learning

Huan Li, Ziqiao Liu, Yiding Li, and Qing-Fu Zhang. The emergence of essential structures in supervised learning. Nature Communications, 14 0 (1): 0 6483, 2023

work page 2023
[26]

Learning without Forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URL https://arxiv.org/abs/1606.09282

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Network dynamics-based framework for understanding deep neural networks, 2025

Yuchen Lin, Yong Zhang, Sihan Feng, and Hong Zhao. Network dynamics-based framework for understanding deep neural networks, 2025. URL https://arxiv.org/abs/2501.02436

work page arXiv 2025
[28]

Understanding R1-Zero-Like Training : A Critical Perspective , March 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like Training : A Critical Perspective , March 2025

work page 2025
[29]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, ...

work page 2025
[30]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Kevin Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Peter Andersen, Ishan Misra, Shubham Singh, et al. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[31]

Topology of reasoning: Understanding large reasoning models through reasoning graph properties, 2025

Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. Topology of reasoning: Understanding large reasoning models through reasoning graph properties, 2025

work page 2025
[32]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. URL https://arxiv.org/abs/2201.02177

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the Limits of Mathematical Reasoning in Open Language Models , April 2024. URL http://arxiv.org/abs/2402.03300. arXiv:2402.03300 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

RL's Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl's razor: Why online reinforcement learning forgets less, 2025. URL https://arxiv.org/abs/2509.04259

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Reflexion: An autonomous agent with dynamic memory and self-reflection, 2023

Noah Shinn, Beck Labash, and Roger Grosse. Reflexion: An autonomous agent with dynamic memory and self-reflection, 2023

work page 2023
[37]

Introduction to phase transitions and critical phenomena

Harry Eugene Stanley and Guenter Ahlers. Introduction to phase transitions and critical phenomena. American Journal of Physics, 40: 0 927--928, 1972. URL https://api.semanticscholar.org/CorpusID:10416417

work page 1972
[38]

Kimi k1.5: Scaling Reinforcement Learning with LLMs , March 2025

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, et al. Kimi k1.5: Scaling Reinforcement Learning with LLMs , March 2025

work page 2025
[39]

A multiscale visualization of attention in the transformer model, 2019

Jesse Vig. A multiscale visualization of attention in the transformer model, 2019

work page 2019
[40]

Understanding reasoning ability of language models from the perspective of reasoning paths aggregation, 2024

Xinyi Wang, Alfonso Amayuelas, Kexun Zhang, Liangming Pan, Wenhu Chen, and William Yang Wang. Understanding reasoning ability of language models from the perspective of reasoning paths aggregation, 2024. URL https://arxiv.org/abs/2402.03268

work page arXiv 2024
[41]

Self-consistency improves chain of thought reasoning in language models, 2022

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2022

work page 2022
[42]

Collective dynamics of ‘small-world’networks

Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature, 393 0 (6684): 0 440--442, 1998

work page 1998
[43]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pp.\ 24824--24837, 2022

work page 2022
[44]

K. G. Wilson and John B. Kogut. The Renormalization group and the epsilon expansion . Phys. Rept., 12: 0 75--199, 1974. doi:10.1016/0370-1573(74)90023-4

work page doi:10.1016/0370-1573(74)90023-4 1974
[45]

Kenneth G. Wilson. The renormalization group and critical phenomena. Reviews of Modern Physics, 55 0 (3): 0 583--600, July 1983. doi:10.1103/RevModPhys.55.583

work page doi:10.1103/revmodphys.55.583 1983
[46]

Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought, 2025

Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, and Chelsea Finn. Towards system 2 reasoning in llms: Learning how to think with meta chain-of-thought, 2025. URL https://arxiv.org/abs/2501.04682

work page arXiv 2025
[47]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Tree of thoughts: Deliberate problem solving with large language models, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Sha, Silvio Savarese, and an an. Tree of thoughts: Deliberate problem solving with large language models, 2023

work page 2023
[49]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, et al. DAPO : An Open - Source LLM Reinforcement Learning System at Scale , March 2025. URL http://arxiv.org/abs/2503.14476. arXiv:2503.14476 [cs] version: 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[52]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[53]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[54]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page