pith. sign in

arxiv: 2606.18089 · v1 · pith:TWDCVDPSnew · submitted 2026-06-16 · 💻 cs.LG

From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning

Pith reviewed 2026-06-27 01:10 UTC · model grok-4.3

classification 💻 cs.LG
keywords compositional generalizationlanguage model reasoningsupervised fine-tuningreinforcement learningreasoning tracesatomic moduleslatent selection
0
0 comments X

The pith

SFT supplies compound reasoning traces that embed atomic modules, while RL decomposes them to enable recombination on new problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models LLM reasoning as the output of a hierarchical latent selection process that composes discrete skills and routing choices into traces. It claims SFT populates the model with traces that contain these atomic modules as raw material, while RL performs the decomposition step that isolates the modules and permits their reuse in novel arrangements. Controlled experiments show that RL applied to SFT traces extracts the modules and recombines them successfully, and that training on full compound traces produces better generalization than training on isolated modules alone. The work identifies a practical division of labor in which SFT must cover every atomic module through varied compositions and RL is then used to explore combinations outside the original SFT distribution.

Core claim

Within the hierarchical latent selection model, reasoning traces are produced by a cascade of discrete latent variables that choose reusable atomic modules consisting of local skills and routing mechanisms; SFT supplies the raw material by generating compound traces that embed these modules, and RL decomposes the traces to recover the latent modules, thereby enabling compositional generalization to new configurations.

What carries the argument

The hierarchical latent selection model, in which discrete latent variables successively select skills and routing mechanisms to generate each reasoning trace.

If this is right

  • RL extracts atomic modules from the compound traces supplied by SFT and recombines them to solve new problem configurations.
  • Training on compound traces produces stronger generalization than training on isolated atomic modules.
  • An effective protocol has SFT ensure coverage of all atomic modules through compositional traces while RL targets novel compositions outside the SFT support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the decomposition view is correct, SFT data collection should prioritize breadth of module combinations rather than single-skill examples.
  • The same separation of roles may apply to other post-training stages such as preference tuning or synthetic data generation.
  • Failures of compositional generalization in current models could be diagnosed by checking whether RL has successfully isolated the latent routing mechanisms.

Load-bearing premise

Real LLM reasoning traces are produced by a cascade of discrete latent selection variables that pick reusable atomic modules in the exact way the model formalizes.

What would settle it

An experiment in which RL applied to SFT-generated compound traces produces no improvement on novel compositions compared with SFT alone, or in which performance remains unchanged when the assumed latent module structure is removed from the training objective.

Figures

Figures reproduced from arXiv: 2606.18089 by Eric P. Xing, Guangyi Chen, Kun Zhang, Lingjing Kong, Martin Q. Ma, Mikhail Yurochkin, Ruslan Salakhutdinov, Taylor W. Killian, Xiangchen Song, Xin Liu, Yuekai Sun, Zhengzhong Liu.

Figure 1
Figure 1. Figure 1: Illustration of atomic skills and routing mechanisms, their compositional traces, and out-of-distribution (OOD) composition evaluation. A reasoning trace is generated by composing two kinds of reusable atomic modules. (i) Atomic skills (colored squares; top-left inventory) are local operations on intermediate state, e.g., applying a rewrite rule, performing an arithmetic step, or executing a string transfo… view at source ↗
Figure 2
Figure 2. Figure 2: Hierarchical Latent Selection Model. A problem de￾scriptor P induces a hierarchy of discrete latent selection variables S which generate an observable reasoning trace D. In §3 we study when this latent structure can be identified from observed (P, D). Appendix K gives a concrete algebra example illustrating skills, routing mechanisms, and identification conditions. ultimately generate an observable reasoni… view at source ↗
Figure 2
Figure 2. Figure 2: Since S is unobserved, learning compositional generalization hinges on whether training can recover (up to an appropriate equivalence) a representation that behaves like the underlying latent modules. We therefore study la￾tent structure identification: when and how the hierarchy of selection variables and their dependency structure can be inferred from observed (P, D). §3 provides conditions un￾der which … view at source ↗
Figure 3
Figure 3. Figure 3: Accuracies of SFT and SFT+RL across composi￾tional depths. Models are trained on traces with Ltrain = 3. RL substantially improves test-time compositional generalization per￾formance over different compositional depths. stage, RL samples rollouts for the same task distribution and receives reward only from the final output string. To construct the chain-of-thought (CoT) training data for SFT, we generate t… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of model accuracy on seen and unseen compositions. RL provides limited gains on seen compositions while delivering substantially larger gains on unseen compositions. drops rapidly with depth while RL remains higher accuracy across depths (avg +40.3%); the SFT-vs-RL gap is much larger on unseen than seen compositions ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy on compositional traces involving held-out atomic skills across varying trace depths and re-injection strategies. The solid line represents the average result over two independent runs with different random seeds, while the shaded region shows the empirical variation across runs. “Iso.” denotes “Isolated”. All settings share the same SFT model trained on L = 3 traces. We remove compositions involv… view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy comparison across settings with full routing compositions, atomic routing mechanism removal, and routing mechanism re-injection with lower depth. We report the perfor￾mance on unseen compositions with both L = 3 and L = 4. SFT and RL data distributions (composition sets) influences atom discovery and reasoning. Specifically, we consider four canonical relationships between the two composition sets… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of models under different SFT-RL data distribution relationships. We evaluate the performance on seen compositions, unseen compositions, and depth extrapolation to higher compositional levels, respectively. The solid line reports the mean average over two seeds, and the shaded region shows variation across random seeds [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of SFT training on atomic skills (L = 1) versus compositional traces (L = 2&3) across composition depths. SFT trained on compositional traces exhibits stronger generalization performance. trace to a sequence of abstract mathematical skills. The RL￾tuned model shows richer diversity in short skill n-grams (n = 2, 3), indicating more flexible local recombination; longer n-grams are less diagnostic… view at source ↗
Figure 9
Figure 9. Figure 9: Toy binary-output diagnostic for trace-support recov￾ery. In this controlled diagnostic, RL rollout feedback reduces the distance to the known target distribution relative to the SFT-only model. E Toy Diagnostic for Trace-Support Recovery The SFT training data is described below. The input se￾quence X has a length of S, and each input character xi has a value range of {0, 1}. There are N input sequences, N… view at source ↗
Figure 10
Figure 10. Figure 10: Skill tokenization pipeline. This pipeline maps solu￾tion traces to sequences of atomic skill tokens and derives n-gram combinations. tribution than the SFT-trained model’s output distribution. F Full Experimental Setup This section gives the full experimental setup in [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Normalized n-gram frequency difference between the RL and SFT model. Positive values indicate higher relative frequency in the RL model (RL - SFT), whereas negative values indicate higher relative frequency in the SFT model [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Difference of numbers of unique grams between the RL and SFT model. We calculate the number of unique grams for both RL and SFT models, then visualize their differences here. recombination of reusable modules [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Recovery effect of re-injecting f⋆ (L = 1) in the SFT stage. We compare the SFT→RL baseline with variants that remove or re-inject f⋆. H Recovery Effect of Re-injection Depends on the Training Stage In [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Longer-run version of [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Longer-run version of [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: A concrete algebra example illustrating skills, routing mechanisms, and identification conditions ( [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: SAE-based causal graph on mathematical reasoning traces. Based on real-world math benchmarks, we apply pretrained SAEs to Gemma-2-2b-it, partition tokens by sequence position, and run causal discovery across SAE features. The graph contains skill-like nodes corresponding to mathematical operations and routing-like nodes corresponding to choices about which operation to apply given the current problem stat… view at source ↗
Figure 18
Figure 18. Figure 18: Full sensitivity experiments. Panel (a) follows the setting of [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
read the original abstract

Post-training pipelines that combine supervised fine-tuning (SFT) with reinforcement learning (RL) have emerged as the key recipe for transforming large language models (LLMs) into robust reasoners. We argue that this combined success is driven by compositional generalization, which we formalize through a hierarchical latent selection model. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including both skills (local operations) and routing mechanisms (how intermediate information is selected, reused, and composed). Within this model, we theoretically show that SFT and RL play asymmetric, complementary roles: SFT supplies the raw module materials in compositional traces, and RL decomposes those traces to identify the latent atomic modules and enable compositional generalization. We design controlled experiments to validate this theory. Our results demonstrate that RL can extract atomic modules from compound traces supplied by SFT and recombine them to solve new configurations. Moreover, we find that training on compound traces yields stronger generalization than training on isolated atomic modules. Finally, we investigate the relationship between SFT and RL data and identify an effective protocol in which SFT ensures coverage of all atomic modules through compositional traces, while RL focuses on novel compositions outside the SFT support to drive exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a hierarchical latent selection model in which LLM reasoning traces are generated by cascades of discrete latent selection variables corresponding to reusable atomic modules (skills and routing mechanisms). It derives an asymmetry within this model whereby SFT supplies compositional traces containing the raw module materials while RL decomposes the traces to identify the latent modules and enable compositional generalization. Controlled experiments are presented to show that RL can extract and recombine modules from SFT-supplied compound traces, that training on compound traces outperforms training on isolated modules, and to recommend a protocol in which SFT ensures module coverage and RL targets novel compositions.

Significance. If the hierarchical latent selection model accurately describes the generative process underlying real LLM reasoning traces, the work supplies a theoretical account of why combined SFT+RL pipelines succeed at compositional generalization and yields a concrete protocol for allocating SFT and RL data. The controlled experiments demonstrate internal consistency of the model and the practical value of compound traces. Credit is due for the explicit formalization and the falsifiable protocol recommendation.

major comments (2)
  1. [Experiments] Experiments section: the controlled experiments generate traces from the hierarchical latent selection model itself rather than from actual LLM forward passes. This validates internal consistency of the derived asymmetry but leaves untested the load-bearing assumption that real reasoning traces are produced by discrete latent cascades of atomic modules and routing variables; without this mapping the protocol recommendation cannot be applied to LLM post-training.
  2. [Theoretical Analysis] Model definition and theoretical analysis: the asymmetric roles of SFT and RL follow directly from the model construction (SFT supplies traces, RL performs decomposition). The paper should provide either (a) an empirical test on real LLM traces or (b) an explicit discussion of how the central claim would be falsified if the discrete latent structure is absent in practice.
minor comments (2)
  1. [Abstract] The abstract and introduction should state more explicitly that the hierarchical model is an explanatory device rather than a claim derived from observed LLM data.
  2. [Model Definition] Notation for the latent selection variables and the decomposition objective could be introduced with a single running example to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify that our controlled experiments are synthetic (generated from the model) and that the asymmetry follows from the model definition. We agree these points warrant additional discussion of assumptions and falsifiability. Below we respond point-by-point and indicate planned revisions.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the controlled experiments generate traces from the hierarchical latent selection model itself rather than from actual LLM forward passes. This validates internal consistency of the derived asymmetry but leaves untested the load-bearing assumption that real reasoning traces are produced by discrete latent cascades of atomic modules and routing variables; without this mapping the protocol recommendation cannot be applied to LLM post-training.

    Authors: We agree that the experiments use traces sampled from the hierarchical latent selection model rather than real LLM forward passes. This choice was deliberate to isolate the mechanism and demonstrate internal consistency of the SFT/RL asymmetry under the model's assumptions. We acknowledge this leaves the mapping to real LLM traces as an untested hypothesis. In revision we will add a new subsection (likely in the Discussion) that (i) explicitly states the assumption, (ii) outlines how the protocol would be applied if the discrete latent structure holds in practice, and (iii) suggests concrete empirical probes (e.g., activation patching or module extraction on real traces) that could test the mapping. This keeps the paper's scope focused on theory while addressing applicability concerns. revision: partial

  2. Referee: [Theoretical Analysis] Model definition and theoretical analysis: the asymmetric roles of SFT and RL follow directly from the model construction (SFT supplies traces, RL performs decomposition). The paper should provide either (a) an empirical test on real LLM traces or (b) an explicit discussion of how the central claim would be falsified if the discrete latent structure is absent in practice.

    Authors: The referee is correct that the asymmetry is a direct consequence of the model construction. We do not claim to have performed (a) an empirical test on real LLM traces; the work is positioned as a formalization supported by controlled experiments. For (b), we will add an explicit falsifiability paragraph: if real reasoning traces lack discrete latent cascades (e.g., if composition occurs via continuous or highly entangled representations), then (i) the predicted SFT/RL division of labor would not hold, (ii) training on compound traces would not outperform isolated modules, and (iii) the recommended protocol would need revision. We will also note observable signatures (modular reuse patterns, routing variables) that could be measured to falsify the model. revision: yes

Circularity Check

0 steps flagged

No significant circularity; model serves as explanatory formalization with separate controlled validation

full rationale

The paper introduces a hierarchical latent selection model as a framework to formalize how reasoning traces arise from cascades of discrete latent variables for atomic modules and routing. Within this model it derives the asymmetric roles of SFT (supplying traces) versus RL (decomposing to modules). Controlled experiments are then described to validate the theory by demonstrating extraction and recombination. No equations, self-citations, or fitted quantities are quoted that reduce the central claim to a definitional identity or to a prediction forced by the same fit. The derivation chain therefore remains self-contained against external benchmarks rather than circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on a newly introduced domain model whose core assumptions are not independently verified outside the paper; no numerical free parameters are mentioned in the abstract.

axioms (1)
  • domain assumption Reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules including skills and routing mechanisms.
    This modeling premise is invoked to formalize compositional generalization and to derive the asymmetric roles of SFT and RL.
invented entities (2)
  • hierarchical latent selection model no independent evidence
    purpose: Formal framework that represents reasoning traces as selections over atomic modules
    Newly proposed structure that carries the entire theoretical argument; no independent evidence supplied in abstract.
  • atomic modules (skills and routing mechanisms) no independent evidence
    purpose: Reusable discrete components whose identification enables compositional generalization
    Postulated entities inside the latent selection model; no external falsifiable signature given.

pith-pipeline@v0.9.1-grok · 5804 in / 1348 out tokens · 39859 ms · 2026-06-27T01:10:46.172294+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

192 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    arXiv preprint arXiv:2512.07783 , year=

    On the interplay of pre-training, mid-training, and rl on reasoning language models , author=. arXiv preprint arXiv:2512.07783 , year=

  2. [2]

    arXiv preprint arXiv:2502.03373 , year=

    Demystifying long chain-of-thought reasoning in llms , author=. arXiv preprint arXiv:2502.03373 , year=

  3. [3]

    arXiv preprint arXiv:2506.10947 , year=

    Spurious rewards: Rethinking training signals in rlvr , author=. arXiv preprint arXiv:2506.10947 , year=

  4. [4]

    arXiv preprint arXiv:2509.21016 , year=

    RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs? , author=. arXiv preprint arXiv:2509.21016 , year=

  5. [5]

    ArXiv , year=

    OpenAI o1 System Card , author=. ArXiv , year=

  6. [6]

    arXiv preprint arXiv:2501.12948 , year=

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  7. [7]

    ArXiv , year=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. ArXiv , year=

  8. [8]

    Shunyu Yao , title =

  9. [9]

    2025 , url =

    Luke Zettlemoyer , title =. 2025 , url =

  10. [10]

    ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

    Mingjie Liu and Shizhe Diao and Ximing Lu and Jian Hu and Xin Dong and Yejin Choi and Jan Kautz and Yi Dong , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.24864 , eprinttype =. 2505.24864 , timestamp =

  11. [11]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui and Yuchen Zhang and Jiacheng Chen and Lifan Yuan and Zhi Wang and Yuxin Zuo and Haozhan Li and Yuchen Fan and Huayu Chen and Weize Chen and Zhiyuan Liu and Hao Peng and Lei Bai and Wanli Ouyang and Yu Cheng and Bowen Zhou and Ning Ding , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.22617 , eprinttype =. 2505.22617 , timestamp =

  12. [12]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

  13. [13]

    Lillicrap and Karen Simonyan and Demis Hassabis , title =

    David Silver and Thomas Hubert and Julian Schrittwieser and Ioannis Antonoglou and Matthew Lai and Arthur Guez and Marc Lanctot and Laurent Sifre and Dharshan Kumaran and Thore Graepel and Timothy P. Lillicrap and Karen Simonyan and Demis Hassabis , title =. CoRR , volume =. 2017 , url =. 1712.01815 , timestamp =

  14. [14]

    David Silver and Aja Huang and Chris J. Maddison and Arthur Guez and Laurent Sifre and George van den Driessche and Julian Schrittwieser and Ioannis Antonoglou and Vedavyas Panneershelvam and Marc Lanctot and Sander Dieleman and Dominik Grewe and John Nham and Nal Kalchbrenner and Ilya Sutskever and Timothy P. Lillicrap and Madeleine Leach and Koray Kavuk...

  15. [15]

    CoRR , volume =

    OpenAI and Ilge Akkaya and Marcin Andrychowicz and Maciek Chociej and Mateusz Litwin and Bob McGrew and Arthur Petron and Alex Paino and Matthias Plappert and Glenn Powell and Raphael Ribas and Jonas Schneider and Nikolas Tezak and Jerry Tworek and Peter Welinder and Lilian Weng and Qiming Yuan and Wojciech Zaremba and Lei Zhang , title =. CoRR , volume =...

  16. [16]

    Psychological Review , volume=

    Acquisition of Cognitive Skill , author=. Psychological Review , volume=

  17. [17]

    Jiayi Pan and Junjie Zhang and Xingyao Wang and Lifan Yuan and Hao Peng and Alane Suhr , title =

  18. [18]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng and Yuzhen Huang and Qian Liu and Wei Liu and Keqing He and Zejun Ma and Junxian He , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.18892 , eprinttype =. 2503.18892 , timestamp =

  19. [19]

    Kakade and Cengiz Pehlevan and Samy Jelassi and Eran Malach , title =

    Rosie Zhao and Alexandru Meterez and Sham M. Kakade and Cengiz Pehlevan and Samy Jelassi and Eran Malach , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.07912 , eprinttype =. 2504.07912 , timestamp =

  20. [20]

    COLM , year =

    Zichen Liu and Changyu Chen and Wenjun Li and Penghui Qi and Tianyu Pang and Chao Du and Wee Sun Lee and Min Lin , title =. COLM , year =

  21. [21]

    Cognition , year=

    Connectionism and cognitive architecture: A critical analysis , author=. Cognition , year=

  22. [22]

    Lake and Tomer D

    Brenden M. Lake and Tomer D. Ullman and Joshua B. Tenenbaum and Samuel J. Gershman , title =. CoRR , volume =

  23. [23]

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Neural Module Networks , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  24. [24]

    ArXiv , year=

    Learning Composable Chains-of-Thought , author=. ArXiv , year=

  25. [25]

    Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

    Tian Xie and Zitian Gao and Qingnan Ren and Haoming Luo and Yuqian Hong and Bryan Dai and Joey Zhou and Kai Qiu and Zhirong Wu and Chong Luo , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.14768 , eprinttype =. 2502.14768 , timestamp =

  26. [26]

    Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P

    Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric ...

  27. [27]

    The Invisible Leash: Why

    Fang Wu and Weihao Xuan and Ximing Lu and Za. The Invisible Leash: Why. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.14843 , eprinttype =. 2507.14843 , timestamp =

  28. [28]

    ArXiv , year=

    The Llama 3 Herd of Models , author=. ArXiv , year=

  29. [29]

    ArXiv , year=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. ArXiv , year=

  30. [30]

    ArXiv , year=

    Stream of Search (SoS): Learning to Search in Language , author=. ArXiv , year=

  31. [31]

    Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Christopher Wilhelm and Luca Soldaini and Noah A

    Nathan Lambert and Jacob Daniel Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James Validad Miranda and Alisa Liu and Nouha Dziri and Xinxi Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D. Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Christopher Wilhelm and Luca Soldai...

  32. [32]

    ArXiv , year=

    RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment , author=. ArXiv , year=

  33. [33]

    ArXiv , year=

    Spurious Rewards: Rethinking Training Signals in RLVR , author=. ArXiv , year=

  34. [34]

    ArXiv , year=

    The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning , author=. ArXiv , year=

  35. [35]

    2025 , url=

    Mirage or Method? How Model-Task Alignment Induces Divergent RL Conclusions , author=. 2025 , url=

  36. [36]

    Nature , author =

    Oriol Vinyals and Igor Babuschkin and Wojciech M. Czarnecki and Micha. Grandmaster level in StarCraft. Nat. , volume =. 2019 , url =. doi:10.1038/S41586-019-1724-Z , timestamp =

  37. [37]

    The Knowledge Engineering Review , volume=

    Pre-training with non-expert human demonstration for deep reinforcement learning , author=. The Knowledge Engineering Review , volume=. 2019 , publisher=

  38. [38]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Encoding human domain knowledge to warm start reinforcement learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  39. [39]

    arXiv preprint arXiv:2506.02355 , year=

    Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening , author=. arXiv preprint arXiv:2506.02355 , year=

  40. [40]

    arXiv preprint arXiv:2506.14245 , year=

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs , author=. arXiv preprint arXiv:2506.14245 , year=

  41. [41]

    arXiv preprint arXiv:2506.01347 , year=

    The surprising effectiveness of negative reinforcement in LLM reasoning , author=. arXiv preprint arXiv:2506.01347 , year=

  42. [42]

    arXiv preprint arXiv:2507.10532 , year=

    Reasoning or memorization? unreliable results of reinforcement learning due to data contamination , author=. arXiv preprint arXiv:2507.10532 , year=

  43. [43]

    arXiv preprint arXiv:2504.20571 , year=

    Reinforcement learning for reasoning in large language models with one training example , author=. arXiv preprint arXiv:2504.20571 , year=

  44. [44]

    arXiv preprint arXiv:2502.01456 , year=

    Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

  45. [45]

    arXiv preprint arXiv:2505.11770 , year=

    Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors , author=. arXiv preprint arXiv:2505.11770 , year=

  46. [46]

    arXiv preprint arXiv:2505.13742 , year=

    Understanding Task Representations in Neural Networks via Bayesian Ablation , author=. arXiv preprint arXiv:2505.13742 , year=

  47. [47]

    arXiv preprint arXiv:2505.14352 , year=

    Towards eliciting latent knowledge from LLMs with mechanistic interpretability , author=. arXiv preprint arXiv:2505.14352 , year=

  48. [48]

    2025 , eprint=

    UFT: Unifying Supervised and Reinforcement Fine-Tuning , author=. 2025 , eprint=

  49. [49]

    arXiv preprint arXiv:2505.13417 , year=

    AdaptThink: Reasoning Models Can Learn When to Think , author=. arXiv preprint arXiv:2505.13417 , year=

  50. [50]

    arXiv preprint arXiv:2505.12629 , year=

    Enhancing Latent Computation in Transformers with Latent Tokens , author=. arXiv preprint arXiv:2505.12629 , year=

  51. [51]

    arXiv preprint arXiv:2505.13898 , year=

    Do Language Models Use Their Depth Efficiently? , author=. arXiv preprint arXiv:2505.13898 , year=

  52. [52]

    arXiv preprint arXiv:2505.13775 , year=

    Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens , author=. arXiv preprint arXiv:2505.13775 , year=

  53. [53]

    arXiv preprint arXiv:2403.09629 , year=

    Quiet-star: Language models can teach themselves to think before speaking , author=. arXiv preprint arXiv:2403.09629 , year=

  54. [54]

    2022 , url=

    Show Your Work: Scratchpads for Intermediate Computation with Language Models , author=. 2022 , url=

  55. [55]

    arXiv preprint arXiv:2405.08644 , year=

    Thinking tokens for language modeling , author=. arXiv preprint arXiv:2405.08644 , year=

  56. [56]

    R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models , year=

    Think before you speak: Training Language Models With Pause Tokens , author=. R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models , year=

  57. [57]

    Advances in Neural Information Processing Systems , volume=

    Training chain-of-thought via latent-variable inference , author=. Advances in Neural Information Processing Systems , volume=

  58. [58]

    arXiv preprint arXiv:2503.18866 , year=

    Reasoning to learn from latent thoughts , author=. arXiv preprint arXiv:2503.18866 , year=

  59. [59]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  60. [60]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Adareasoner: Adaptive reasoning enables more flexible thinking , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  61. [61]

    arXiv preprint arXiv:2505.11756 , year=

    Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders , author=. arXiv preprint arXiv:2505.11756 , year=

  62. [62]

    2025 , eprint=

    Reinforcing General Reasoning without Verifiers , author=. 2025 , eprint=

  63. [63]

    arXiv preprint arXiv:2503.19618 , year=

    Learning to chain-of-thought with Jensen's evidence lower bound , author=. arXiv preprint arXiv:2503.19618 , year=

  64. [64]

    arXiv preprint arXiv:2502.09297 , year=

    When do neural networks learn world models? , author=. arXiv preprint arXiv:2502.09297 , year=

  65. [65]

    arXiv preprint arXiv:2505.15134 , year=

    The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning , author=. arXiv preprint arXiv:2505.15134 , year=

  66. [66]

    2025 , eprint=

    Maximizing Confidence Alone Improves Reasoning , author=. 2025 , eprint=

  67. [67]

    2025 , eprint=

    Test-Time Learning for Large Language Models , author=. 2025 , eprint=

  68. [68]

    arXiv preprint arXiv:2504.09597 , year=

    Understanding llm behaviors via compression: Data generation, knowledge acquisition and scaling laws , author=. arXiv preprint arXiv:2504.09597 , year=

  69. [69]

    arXiv preprint arXiv:2505.18404 , year=

    Thought calibration: Efficient and confident test-time scaling , author=. arXiv preprint arXiv:2505.18404 , year=

  70. [70]

    arXiv preprint arXiv:2505.18373 , year=

    Next-token pretraining implies in-context learning , author=. arXiv preprint arXiv:2505.18373 , year=

  71. [71]

    arXiv preprint arXiv:2505.13379 , year=

    Thinkless: Llm learns when to think , author=. arXiv preprint arXiv:2505.13379 , year=

  72. [72]

    arXiv preprint arXiv:2505.21444 , year=

    Can Large Reasoning Models Self-Train? , author=. arXiv preprint arXiv:2505.21444 , year=

  73. [73]

    arXiv preprint arXiv:2505.17697 , year=

    Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models , author=. arXiv preprint arXiv:2505.17697 , year=

  74. [74]

    arXiv preprint arXiv:2505.16400 , year=

    AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning , author=. arXiv preprint arXiv:2505.16400 , year=

  75. [75]

    2025 , eprint=

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning , author=. 2025 , eprint=

  76. [76]

    2025 , eprint=

    Incentivizing LLMs to Self-Verify Their Answers , author=. 2025 , eprint=

  77. [77]

    arXiv preprint arXiv:2504.13837 , year=

    Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=

  78. [78]

    arXiv preprint arXiv:2505.12392 , year=

    SLOT: Sample-specific Language Model Optimization at Test-time , author=. arXiv preprint arXiv:2505.12392 , year=

  79. [79]

    arXiv preprint arXiv:2505.11711 , year=

    Reinforcement Learning Finetunes Small Subnetworks in Large Language Models , author=. arXiv preprint arXiv:2505.11711 , year=

  80. [80]

    arXiv preprint arXiv:2502.12118 , year=

    Scaling test-time compute without verification or rl is suboptimal , author=. arXiv preprint arXiv:2502.12118 , year=

Showing first 80 references.