A Verifiable Search Is Not a Learnable Chain-of-Thought

Harsh Patel

arxiv: 2606.21884 · v1 · pith:AVQAUMOTnew · submitted 2026-06-20 · 💻 cs.LG · cs.AI· cs.CL

A Verifiable Search Is Not a Learnable Chain-of-Thought

Harsh Patel This is my paper

Pith reviewed 2026-06-26 12:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords chain-of-thoughtdistillationcryptarithmverifiable searchbacktrackingLoRAmemorizationverification

0 comments

The pith

Search over information-free structure cannot be imitated as forward chain-of-thought.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether models can learn chain-of-thought for tasks that require search. On forward-computable tasks like arithmetic, distillation works well. On cryptarithm, which needs backtracking search, even extensive training fails to produce faithful derivations, though the model handles individual steps. Revealing the solution key turns the task forward and improves performance. This shows that certain search procedures resist learning as left-to-right reasoning and instead require precomputing the search results into a lookup.

Core claim

When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.

What carries the argument

The controlled key-revealing intervention that converts backtracking search into forward derivation by supplying the cipher solution upfront.

If this is right

Forward-computable tasks such as lookup, arithmetic, and 8-bit boolean install with accuracies of 0.99 and 0.68.
Cryptarithm distillation holds at 0.01-0.07 across eleven designs, RL, and self-training despite a search solver reaching 71%.
Models perform arithmetic correctly on 97-100% of lines and rank the correct cipher in the top eight 71% of the time but cannot carry the search forward.
Fine-tuning learns the shape of a verifiable elimination step while verdicts become unconditional templates correct only 16-57% of the time.
Precomputing the combinatorial core into a catalog reduces the trace to recall plus verification and reaches 0.92 on the private leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Chain-of-thought distillation may be limited to tasks whose solutions have inherent left-to-right structure without hidden combinatorial search.
The result suggests that hybrid systems combining model recall with external search modules will be required for this class of problems.
The same separation between search and forward derivation could appear in other combinatorial reasoning tasks whose generators are deterministic but whose solutions depend on exhaustive elimination.

Load-bearing premise

The eleven chain-of-thought designs, RL from verifiable rewards, self-training, and the key-revealing intervention together test whether search can be learned as forward derivation rather than some other untested regime succeeding.

What would settle it

A model achieving high accuracy on held-out cryptarithm instances via forward chain-of-thought distillation alone, without key revelation or precomputed catalog, would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.21884 by Harsh Patel.

**Figure 1.** Figure 1: The learnability frontier. Solver (backtracking search) vs. the forward-derivable ceiling (what a left-to-right CoT could honestly imitate) vs. the fine-tuned model. For lookup/fit tasks all three coincide. For cryptarithm the solver reaches 0.71, but the forward-derivable ceiling collapses to ≈ 0.10 and the model tracks that, not the solver (0.05). bit_manipulation is the exception where forward derivatio… view at source ↗

**Figure 2.** Figure 2: Witnessed vs. teleported verdict. The teleport line is verbatim from a trained model’s transcript (10b71e8a): it computes “6*4 ends 4”, which matches the target ones digit 4, on the very same line, then concludes “none matches → drop,” wrongly eliminating the correct operation. The arithmetic is correct; the verdict does not follow from it. SFT on traces whose verdicts are not witnessed installs exactly th… view at source ↗

**Figure 3.** Figure 3: Bit-manipulation: only the model’s own search transfers. Hand-written CoT that teleports the rule search scores like the base model (0.053). STaR on verifier-passed self-traces lifts accuracy and collapses budget truncation, because the imitated traces are searches the model can actually run within budget. Baseline→STaR is significant (disjoint 95% CIs, n=500). The r1→r2 step is within noise. grammar that … view at source ↗

**Figure 4.** Figure 4: One floor, many rounds. Cryptarithm-deduce accuracy under successive CoT designs (r1–r11. The nine reaching a held-out greedy eval are shown; RLVR/GRPO and the generate-andverify reframe ran between rounds; [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Verdict-as-token. Across the elimination-line types that drive cryptarithm, the arithmetic on each line is essentially perfect while the line’s verdict, “therefore drop this candidate”, follows from its own numbers only 16–51% of the time. The model imitates the shape of a verifiable step without its content. Line counts n = 390/102/79/35. The arithmetic-vs-verdict gap is significant for every type (95% CI… view at source ↗

**Figure 6.** Figure 6: Leaderboard trajectory. v15→v16 is a regression from a warm-start that failed to enter the new grammars at greedy decoding; v17 (0.85) is the first reproducible from-base bank. V18 adds bit-manipulation STaR but does not move the total, the headroom that remains is cryptarithm. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

read the original abstract

It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-<= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time ("verdict-as-token"). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Search tasks like cryptarithm resist CoT distillation while forward tasks succeed, with the key-revealing intervention cleanly showing why.

read the letter

The paper's main result is that forward-computable tasks distill into CoT fine-tuning, but search over information-free structure does not. On cryptarithm, eleven CoT designs, RL from verifiable rewards, and self-training all stay near 0.01-0.07 accuracy even though a solver reaches 71%. The model still does the arithmetic correctly on most lines and ranks the right cipher in its top eight, so the gap is not raw capability. Revealing the key turns the problem forward and lifts performance on the same instances from 0.03 to 0.57, which isolates the search component as the blocker.

What the work does well is the controlled setup: generators shared between public and hidden splits, reverse-engineered solvers as ground truth, and the explicit contrast with lookup and boolean tasks that do transfer. The "verdict-as-token" pattern is also a useful observation. These pieces give a concrete empirical distinction rather than another vague claim about reasoning limits.

The soft spot is the strong phrasing that no faithful forward CoT exists at all. The experiments cover a wide range of standard regimes across model sizes, but they leave open whether denser process supervision, auxiliary planning losses, or other untested regimes could produce an approximate forward surrogate. The paper shows the tested methods fail; it does not close the door on every possible training approach.

This is worth a serious referee for anyone working on what CoT can actually capture. The evidence is direct enough that the finding should be in the literature even if later work finds workarounds.

Referee Report

2 major / 1 minor

Summary. The paper claims that tasks whose only solution is search over information-free structure (e.g., cryptarithm) admit no faithful forward chain-of-thought that can be imitated via distillation, RL from verifiable rewards, or self-training. This is shown by failure (0.01-0.07 accuracy) across eleven CoT designs on a 30B Nemotron model despite a solver reaching 71%, contrasted with high success on forward-computable tasks; the model performs local arithmetic (97-100%) and ranking but cannot carry search forward. A key-revealing intervention lifts performance from 0.03 to 0.57 on the same instances, and success is achieved only by precomputing the combinatorial core into a catalog (reducing to recall+verification, reaching 0.92 LB). The result holds across backbones 3B-671B.

Significance. If the central empirical distinction holds, the work provides concrete evidence that not every short-program-solvable task can be taught as left-to-right CoT imitation, separating search from forward derivation. Strengths include the controlled intervention isolating the forward vs. search distinction, consistent failure across multiple training regimes and model scales, and the explicit contrast with the catalog-based solution that succeeds.

major comments (2)

[Abstract] Abstract: the claim that 'no faithful forward chain-of-thought exists to imitate' is stronger than the reported evidence, which demonstrates failure only for the eleven tested CoT designs, RL from verifiable rewards, and self-training; the manuscript does not test or rule out other regimes (e.g., dense process supervision or auxiliary planning losses) that might induce an approximate forward surrogate.
[Abstract] Abstract: the statement that 'the model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%' is presented as evidence that the failure is isolated to search, but without a table or section detailing the exact measurement protocol, aggregation across the eleven designs, or per-instance breakdown, it is difficult to assess whether local steps are truly solved or merely templated.

minor comments (1)

The abstract refers to 'nine reasoning tasks' and 'eleven chain-of-thought designs' without enumerating them or providing a table; adding an explicit list or appendix table would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our claims and improve the presentation of our measurements. We address each major comment below and will make corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'no faithful forward chain-of-thought exists to imitate' is stronger than the reported evidence, which demonstrates failure only for the eleven tested CoT designs, RL from verifiable rewards, and self-training; the manuscript does not test or rule out other regimes (e.g., dense process supervision or auxiliary planning losses) that might induce an approximate forward surrogate.

Authors: We agree that the absolute phrasing exceeds the tested regimes. Our experiments cover eleven distinct CoT formats, RL from verifiable rewards, and self-training across model scales, all of which fail to induce faithful forward search. The key-revealing intervention and contrast with catalog-based solutions further isolate the forward-vs-search distinction. Nevertheless, we cannot rule out every conceivable auxiliary loss. We will revise the abstract to read 'no faithful forward chain-of-thought was found to imitate under the tested regimes' and add a limitations paragraph discussing dense process supervision and planning losses as open directions. revision: partial
Referee: [Abstract] Abstract: the statement that 'the model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%' is presented as evidence that the failure is isolated to search, but without a table or section detailing the exact measurement protocol, aggregation across the eleven designs, or per-instance breakdown, it is difficult to assess whether local steps are truly solved or merely templated.

Authors: The referee is correct that the measurement protocol requires explicit documentation. We will insert a new subsection (Methods 3.4) that specifies: (i) line-by-line arithmetic verification via exact string matching against the solver trace, (ii) ranking measured by the position of the ground-truth next cipher in the model's top-8 logits at each elimination step, (iii) aggregation as macro-average over all generated lines across the eleven designs, and (iv) per-instance and per-design breakdowns placed in Appendix C. Manual audit of 200 random lines confirmed non-templated arithmetic (97-100% accuracy) independent of search success. revision: yes

Circularity Check

0 steps flagged

Empirical study with direct experimental comparisons; no circular derivation

full rationale

The paper reports experimental results from distilling chain-of-thought across eleven designs, RL from verifiable rewards, self-training, multiple model scales, and a controlled key-revealing intervention on nine tasks generated from deterministic solvers. Central claims rest on observed accuracy gaps (e.g., 0.01-0.07 vs. 71% solver) and the intervention lift (0.03 to 0.57), which are measured against external solvers and held-out splits rather than derived from fitted parameters or self-referential equations. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing premises; the work is self-contained against the reported benchmarks and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim relies on the empirical outcomes of the distillation experiments and the key-revealing intervention, with the main assumption being the validity of the task generators as proxies for search procedures.

axioms (1)

domain assumption Public and hidden splits from the same deterministic generators can proxy for held-out test accuracy on the procedure.
This allows testing if the model learns the general procedure rather than specific instances.

pith-pipeline@v0.9.1-grok · 5917 in / 1341 out tokens · 47455 ms · 2026-06-26T12:33:28.331775+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Wrong-finished

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length gener- alization in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https://proceedings.neurips.cc/paper_files/p 26 Table 17: Length...

arXiv 2022
[2]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 28, 2015. URL https://papers.nips.cc/pap er_files/paper/2015/hash/e995f98d56967d946471af29d7bf99f1- Abstract.html . arXiv:1506.03099

Pith/arXiv arXiv 2015
[3]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

Pith/arXiv arXiv
[4]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofPMLR, pages 10041–10071. PMLR, 2024. URL https://proceedings.mlr.press/v235/dao24a.html. arXiv:2405.21060

Pith/arXiv arXiv 2024
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, et al. DeepSeek-R1 incentivizes reasoning in LLMs through rein- forcement learning.Nature, 645:633–638, 2025. doi: 10.1038/s41586-025-09422-z. URL https://www.nature.com/articles/s41586-025-09422-z. arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025
[6]

Grégoire Delétang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A. Ortega. Neural networks and the chomsky hierarchy. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://openreview.net/forum?id=WbxHAzkeQcn. arXiv:2207.02098

arXiv 2023
[7]

Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. InAdvances in Neural Information Processing Systems (Neu...

arXiv 2023
[8]

1st place solution: NVIDIA nemotron model reasoning challenge

GoodMeatDay, re, and reopon. 1st place solution: NVIDIA nemotron model reasoning challenge. Kaggle competition write-up, https://www.kaggle.com/competitions/nvidia-nemotro n-model-reasoning-challenge/writeups/1st-place-solution , 2026. Team NullSira; Private LB 0.920; memorization–computation split (signature catalog + DFS verify). Accessed 2026-06-17

2026
[9]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021. URLhttps://datasets-benchmarks-proceed ings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-rou nd2.html. arXiv:2103.03874

Pith/arXiv arXiv 2021
[10]

Distilling the knowledge in a neural network,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,
[11]

NIPS 2014 Deep Learning Workshop. 28

2014
[12]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017. ACL, 2023. doi: 10....

Pith/arXiv arXiv 2023
[13]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. URLhttps://openrevi ew.net/forum?id=nZeVKeeFYf9. arXiv:2106.09685

Pith/arXiv arXiv 2022
[14]

Kakade, and Eran Malach

Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying. InProceedings of the 41st International Conference on Machine Learning (ICML), PMLR. PMLR, 2024. URL https://proceedings.mlr.press/v235/jelassi24a.html. arXiv:2402.01032

arXiv 2024
[15]

NVIDIA nemotron model reasoning challenge

Kaggle and NVIDIA. NVIDIA nemotron model reasoning challenge. https://www.kagg le.com/competitions/nvidia-nemotron-model-reasoning-challenge , 2026. Accessed 2026-06-16

2026
[16]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626. ACM, 2023. doi: 10.1145/3600006.3613

work page doi:10.1145/3600006.3613 2023
[17]

Tülu 3: Pushing frontiers in open language model post-training, 2024

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, et al. Tülu 3: Pushing frontiers in open language model post-training, 2024. arXiv:2411.15124

Pith/arXiv arXiv 2024
[18]

Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023. Anthropic technical report

Pith/arXiv arXiv 2023
[19]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations (ICLR), 2024. URLhttps://openrevi ew.net/forum?id=v8L0pN6EOi. arXiv:2305.20050

Pith/arXiv arXiv 2024
[20]

Faithful chain-of-thought reasoning

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. InProceedings of the 13th International Joint Conference on Natural Language Processing (IJCNLP-AACL), pages 305–
[21]

doi: 10.18653/v1/2023.ijcnlp-main.20

ACL, 2023. doi: 10.18653/v1/2023.ijcnlp-main.20. URLhttps://aclanthology.org/2 023.ijcnlp-main.20/

work page doi:10.18653/v1/2023.ijcnlp-main.20 2023
[22]

The illusion of state in state-space models

William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models. InProceedings of the 41st International Conference on Machine Learning (ICML), PMLR. PMLR, 2024. URL https://proceedings.mlr.press/v235/merrill24a.html . arXiv:2404.08819. 29

arXiv 2024
[23]

s1: Simple test-time scaling, 2025

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. arXiv:2501.19393

Pith/arXiv arXiv 2025
[24]

Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning, 2025

NVIDIA. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning, 2025. arXiv:2512.20848

arXiv 2025
[25]

Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. ICLR 2021 MATH-AI Workshop

2022
[26]

Sequence level training with recurrent neural networks

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InInternational Conference on Learning Representa- tions (ICLR), 2016. arXiv:1511.06732

Pith/arXiv arXiv 2016
[27]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Yu Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. arXiv:2402.03300

Pith/arXiv arXiv 2024
[28]

Nemotron model reasoning challenge: Open progress prize solution.https: //github.com/tonghuikang/nemotron, 2026

Hui Kang Tong. Nemotron model reasoning challenge: Open progress prize solution.https: //github.com/tonghuikang/nemotron, 2026. Open Progress Prize; public leaderboard 0.85

2026
[29]

End-to-end fine-tuning for LB 0.85

Hui Kang Tong. End-to-end fine-tuning for LB 0.85. Kaggle notebook,https://www.kaggle.c om/code/huikang/end-to-end-finetuning-for-lb-0-85 , 2026. Published Open Progress Prize recipe; our native-HF pipeline forks this notebook

2026
[30]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ed3fea9033a80fea1 376299fa7863f4a-Abstract-C...

Pith/arXiv arXiv 2023
[31]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps: //openreview.net/forum?id=1PL1NIMMrw. arXiv:2203.11171

Pith/arXiv arXiv 2023
[32]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524e cf4f15af0f7b31...

Pith/arXiv arXiv 2022
[33]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35,
[34]

arXiv:2203.14465

URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/639a9a172 c044fbb64175b5fad42e9a5-Abstract-Conference.html. arXiv:2203.14465

arXiv 2022
[35]

What algorithms can transformers learn? a study in length generalization

Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization. InInternational Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=AssIuHnmHX. arXiv:2310.16028. 30

arXiv 2024

[1] [1]

Wrong-finished

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length gener- alization in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https://proceedings.neurips.cc/paper_files/p 26 Table 17: Length...

arXiv 2022

[2] [2]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 28, 2015. URL https://papers.nips.cc/pap er_files/paper/2015/hash/e995f98d56967d946471af29d7bf99f1- Abstract.html . arXiv:1506.03099

Pith/arXiv arXiv 2015

[3] [3]

Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

Pith/arXiv arXiv

[4] [4]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofPMLR, pages 10041–10071. PMLR, 2024. URL https://proceedings.mlr.press/v235/dao24a.html. arXiv:2405.21060

Pith/arXiv arXiv 2024

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, et al. DeepSeek-R1 incentivizes reasoning in LLMs through rein- forcement learning.Nature, 645:633–638, 2025. doi: 10.1038/s41586-025-09422-z. URL https://www.nature.com/articles/s41586-025-09422-z. arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025

[6] [6]

Grégoire Delétang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A. Ortega. Neural networks and the chomsky hierarchy. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://openreview.net/forum?id=WbxHAzkeQcn. arXiv:2207.02098

arXiv 2023

[7] [7]

Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. InAdvances in Neural Information Processing Systems (Neu...

arXiv 2023

[8] [8]

1st place solution: NVIDIA nemotron model reasoning challenge

GoodMeatDay, re, and reopon. 1st place solution: NVIDIA nemotron model reasoning challenge. Kaggle competition write-up, https://www.kaggle.com/competitions/nvidia-nemotro n-model-reasoning-challenge/writeups/1st-place-solution , 2026. Team NullSira; Private LB 0.920; memorization–computation split (signature catalog + DFS verify). Accessed 2026-06-17

2026

[9] [9]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021. URLhttps://datasets-benchmarks-proceed ings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-rou nd2.html. arXiv:2103.03874

Pith/arXiv arXiv 2021

[10] [10]

Distilling the knowledge in a neural network,

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

[11] [11]

NIPS 2014 Deep Learning Workshop. 28

2014

[12] [12]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017. ACL, 2023. doi: 10....

Pith/arXiv arXiv 2023

[13] [13]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. URLhttps://openrevi ew.net/forum?id=nZeVKeeFYf9. arXiv:2106.09685

Pith/arXiv arXiv 2022

[14] [14]

Kakade, and Eran Malach

Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying. InProceedings of the 41st International Conference on Machine Learning (ICML), PMLR. PMLR, 2024. URL https://proceedings.mlr.press/v235/jelassi24a.html. arXiv:2402.01032

arXiv 2024

[15] [15]

NVIDIA nemotron model reasoning challenge

Kaggle and NVIDIA. NVIDIA nemotron model reasoning challenge. https://www.kagg le.com/competitions/nvidia-nemotron-model-reasoning-challenge , 2026. Accessed 2026-06-16

2026

[16] [16]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626. ACM, 2023. doi: 10.1145/3600006.3613

work page doi:10.1145/3600006.3613 2023

[17] [17]

Tülu 3: Pushing frontiers in open language model post-training, 2024

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, et al. Tülu 3: Pushing frontiers in open language model post-training, 2024. arXiv:2411.15124

Pith/arXiv arXiv 2024

[18] [18]

Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023. Anthropic technical report

Pith/arXiv arXiv 2023

[19] [19]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations (ICLR), 2024. URLhttps://openrevi ew.net/forum?id=v8L0pN6EOi. arXiv:2305.20050

Pith/arXiv arXiv 2024

[20] [20]

Faithful chain-of-thought reasoning

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. InProceedings of the 13th International Joint Conference on Natural Language Processing (IJCNLP-AACL), pages 305–

[21] [21]

doi: 10.18653/v1/2023.ijcnlp-main.20

ACL, 2023. doi: 10.18653/v1/2023.ijcnlp-main.20. URLhttps://aclanthology.org/2 023.ijcnlp-main.20/

work page doi:10.18653/v1/2023.ijcnlp-main.20 2023

[22] [22]

The illusion of state in state-space models

William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models. InProceedings of the 41st International Conference on Machine Learning (ICML), PMLR. PMLR, 2024. URL https://proceedings.mlr.press/v235/merrill24a.html . arXiv:2404.08819. 29

arXiv 2024

[23] [23]

s1: Simple test-time scaling, 2025

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. arXiv:2501.19393

Pith/arXiv arXiv 2025

[24] [24]

Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning, 2025

NVIDIA. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning, 2025. arXiv:2512.20848

arXiv 2025

[25] [25]

Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. ICLR 2021 MATH-AI Workshop

2022

[26] [26]

Sequence level training with recurrent neural networks

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InInternational Conference on Learning Representa- tions (ICLR), 2016. arXiv:1511.06732

Pith/arXiv arXiv 2016

[27] [27]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Yu Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. arXiv:2402.03300

Pith/arXiv arXiv 2024

[28] [28]

Nemotron model reasoning challenge: Open progress prize solution.https: //github.com/tonghuikang/nemotron, 2026

Hui Kang Tong. Nemotron model reasoning challenge: Open progress prize solution.https: //github.com/tonghuikang/nemotron, 2026. Open Progress Prize; public leaderboard 0.85

2026

[29] [29]

End-to-end fine-tuning for LB 0.85

Hui Kang Tong. End-to-end fine-tuning for LB 0.85. Kaggle notebook,https://www.kaggle.c om/code/huikang/end-to-end-finetuning-for-lb-0-85 , 2026. Published Open Progress Prize recipe; our native-HF pipeline forks this notebook

2026

[30] [30]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ed3fea9033a80fea1 376299fa7863f4a-Abstract-C...

Pith/arXiv arXiv 2023

[31] [31]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps: //openreview.net/forum?id=1PL1NIMMrw. arXiv:2203.11171

Pith/arXiv arXiv 2023

[32] [32]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524e cf4f15af0f7b31...

Pith/arXiv arXiv 2022

[33] [33]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35,

[34] [34]

arXiv:2203.14465

URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/639a9a172 c044fbb64175b5fad42e9a5-Abstract-Conference.html. arXiv:2203.14465

arXiv 2022

[35] [35]

What algorithms can transformers learn? a study in length generalization

Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization. InInternational Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=AssIuHnmHX. arXiv:2310.16028. 30

arXiv 2024