arxiv: 2604.01348 · v2 · submitted 2026-04-01 · 💻 cs.CL

Recognition: no theorem link

Procedural Knowledge at Scale Improves Reasoning

Di Wu , Devendra Singh Sachan , Wen-tau Yih , Mingda Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords reasoning memoryprocedural knowledgeretrieval augmented generationtest-time scalingsubquestion decompositionmath benchmarkscoding taskslanguage model reasoning

0 comments

The pith

Decomposing reasoning trajectories into subquestion-subroutine pairs and retrieving them during inference boosts performance on math, science, and coding tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that language models can improve at hard reasoning by systematically reusing pieces of how-to knowledge extracted from earlier solutions rather than treating each new problem in isolation. Existing step-by-step trajectories are broken into compact, self-contained subquestion-subroutine pairs that form a large datastore. At test time a lightweight prompt lets the model surface the current subquestion, pull relevant subroutines, and continue reasoning with those pieces as guides. This Reasoning Memory approach outperforms retrieval of full documents, full trajectories, or templates as well as a compute-matched scaling baseline, reaching gains of up to 19.2 percent over no retrieval and 7.9 percent over the strongest baseline. The improvements trace to the breadth of procedural coverage and the decomposition-retrieval design that lets the model draw on diverse prior ways of reframing, approaching, and verifying problems.

Core claim

Reasoning Memory starts from existing corpora of step-by-step reasoning trajectories, decomposes each trajectory into self-contained subquestion-subroutine pairs to create a datastore of 32 million compact procedural knowledge entries, and at inference time uses a lightweight in-thought prompt to let the model verbalize the core subquestion, retrieve relevant subroutines, and reason under them as implicit procedural priors. Across six math, science, and coding benchmarks this consistently outperforms RAG with document, trajectory, or template knowledge and a compute-matched test-time scaling baseline, with higher inference budgets yielding up to 19.2 percent improvement over no retrieval and

What carries the argument

Reasoning Memory: a retrieval-augmented generation framework that decomposes reasoning trajectories into subquestion-subroutine pairs to supply procedural knowledge at inference time.

If this is right

Models gain from reusing how-to steps extracted from past solutions instead of solving every problem from scratch.
Broad coverage of procedural patterns across large trajectory corpora drives larger gains than retrieval of facts or complete examples.
Higher inference budgets amplify the benefit of retrieving and conditioning on procedural subroutines.
The specific decomposition into subquestion-subroutine pairs enables cleaner extraction and safer reuse than whole-trajectory retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition approach could be applied to planning or multi-step tool-use tasks where procedural reuse matters more than factual lookup.
Models might close the loop by adding newly generated successful trajectories back into the datastore after each run.
Scaling the size and diversity of the procedural datastore could offer a cheaper alternative to scaling model parameters for reasoning gains.

Load-bearing premise

Decomposing existing reasoning trajectories into self-contained subquestion-subroutine pairs produces procedural knowledge that remains useful and non-misleading when retrieved and inserted into new reasoning traces.

What would settle it

A controlled test in which the retrieved subroutines are deliberately inserted into the model's trace and the resulting accuracy drops below the no-retrieval baseline on the same benchmarks.

read the original abstract

Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Decomposing trajectories into subroutines and retrieving them in-thought yields benchmark gains, but tighter checks on transfer are needed.

read the letter

The punchline is that decomposing reasoning trajectories into subquestion-subroutine pairs and retrieving them via in-thought prompts gives consistent gains over standard RAG and test-time scaling on several benchmarks. What the paper does well is lay out a practical framework for turning existing step-by-step solutions into a large datastore of 32 million compact procedural entries. The in-thought prompting approach lets the model surface a subquestion and pull relevant subroutines without disrupting the main reasoning flow. Across six tasks they show outperformance, with notable improvements when the inference budget is higher. The soft spots are in the strength of the supporting evidence. The abstract claims the gains come from procedural coverage and the decomposition design, but it does not include error bars, precise baseline implementations, or analysis of retrieval quality. Without that, it is possible that some of the lift comes from increased prompt diversity or length rather than genuine procedural reuse. The assumption that the extracted pairs remain non-misleading in new contexts is central but gets limited direct support in the summary. This work is aimed at people studying test-time methods for language model reasoning. Readers who care about scalable ways to reuse knowledge from prior trajectories will find the concrete design and results worth examining. It deserves a serious referee because the idea is novel enough and the reported pattern is strong enough to justify detailed scrutiny. Recommendation: Send it for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Reasoning Memory, a RAG framework that decomposes existing step-by-step reasoning trajectories into a datastore of 32 million self-contained subquestion-subroutine pairs. At inference time, a lightweight in-thought prompt enables the model to verbalize subquestions, retrieve relevant procedural knowledge, and reason under the retrieved subroutines as implicit priors. The authors report consistent outperformance over RAG variants using documents, full trajectories, or templates, as well as compute-matched test-time scaling baselines, with gains reaching 19.2% over no-retrieval and 7.9% over the strongest baseline across six math, science, and coding benchmarks. Ablations attribute the improvements to broad procedural coverage and the decomposition/retrieval design.

Significance. If the central empirical claims hold after addressing verification gaps, the work would establish that large-scale extraction and reuse of procedural knowledge from trajectories can deliver meaningful test-time gains in reasoning without retraining. The scale of the 32M-entry datastore and the consistent cross-domain improvements would position this as a practical complement to existing test-time scaling methods, potentially influencing how future systems store and retrieve reusable reasoning patterns.

major comments (2)

[Abstract and experimental results] Abstract and experimental results: the performance claims (19.2% and 7.9% lifts) are presented without error bars, statistical significance tests, or detailed baseline implementation descriptions (e.g., exact retrieval parameters or prompt lengths for the compute-matched scaling baseline). This makes it impossible to assess whether the reported gains are robust or could be explained by confounds such as increased context length or retrieval noise.
[Experimental results] Experimental results: no per-retrieval or per-insertion analysis is provided to show that retrieved subroutines are actually used by the model, remain non-misleading, or transfer usefully to new problems. Without such evidence (e.g., manual inspection of traces or metrics on subroutine relevance/usage), the attribution of gains specifically to procedural knowledge reuse rather than prompt diversity remains unverified and load-bearing for the central claim.

minor comments (2)

[Abstract] Abstract: the phrase 'lightweight in-thought prompt' is introduced without a brief definition or example; a short illustrative snippet would improve immediate clarity.
[Method] The manuscript would benefit from an explicit statement of the total number of trajectories used to build the 32M-entry datastore and the decomposition algorithm's exact criteria for 'self-contained' pairs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve experimental rigor and provide additional verification of the mechanism.

read point-by-point responses

Referee: [Abstract and experimental results] Abstract and experimental results: the performance claims (19.2% and 7.9% lifts) are presented without error bars, statistical significance tests, or detailed baseline implementation descriptions (e.g., exact retrieval parameters or prompt lengths for the compute-matched scaling baseline). This makes it impossible to assess whether the reported gains are robust or could be explained by confounds such as increased context length or retrieval noise.

Authors: We agree that error bars, statistical significance tests, and fuller baseline details are needed to demonstrate robustness. In the revised version we will report standard deviations over multiple random seeds for all main results, include paired statistical tests (e.g., t-tests) against baselines, and expand the experimental setup and appendix with precise retrieval parameters (top-k, embedding model, similarity threshold), exact prompt lengths, and total token budgets for every baseline including the compute-matched test-time scaling condition. We will also add a controlled ablation that matches total context length across methods to rule out length-related confounds. revision: yes
Referee: [Experimental results] Experimental results: no per-retrieval or per-insertion analysis is provided to show that retrieved subroutines are actually used by the model, remain non-misleading, or transfer usefully to new problems. Without such evidence (e.g., manual inspection of traces or metrics on subroutine relevance/usage), the attribution of gains specifically to procedural knowledge reuse rather than prompt diversity remains unverified and load-bearing for the central claim.

Authors: We acknowledge the value of direct evidence that the model actually consults and benefits from the retrieved subroutines. We will add a new analysis subsection containing (1) manual inspection of 100 randomly sampled reasoning traces with counts of explicit references to retrieved subroutines, (2) a quantitative relevance metric (semantic similarity between verbalized subquestion and retrieved subroutine), and (3) an ablation measuring performance drop when retrieved items are replaced by random or irrelevant subroutines. These additions will help confirm that gains arise from procedural reuse rather than prompt diversity alone. Our existing decomposition ablations already isolate the design contribution, but the requested per-retrieval diagnostics will strengthen the mechanistic claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical RAG framework

full rationale

The paper describes an empirical method: decompose existing reasoning trajectories into subquestion-subroutine pairs to build a 32M-entry datastore, then retrieve and insert them via in-thought prompts at inference time. Performance claims (up to 19.2% over no-retrieval, 7.9% over compute-matched baselines) rest on benchmark evaluations and ablations attributing gains to procedural coverage plus decomposition/retrieval design. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The datastore is constructed from prior corpora and evaluated on separate benchmarks, keeping the derivation self-contained against external results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that procedural knowledge in reasoning trajectories can be usefully decomposed into independent subquestion-subroutine pairs and that retrieval of those pairs provides non-harmful guidance during new inference. No free parameters are explicitly fitted in the abstract description, and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Reasoning trajectories contain reusable procedural knowledge that can be decomposed into self-contained subquestion-subroutine pairs without loss of utility.
Invoked in the description of datastore construction and the claim that gains come from broad procedural coverage.

invented entities (1)

Reasoning Memory framework no independent evidence
purpose: Retrieval-augmented generation system that stores and retrieves procedural subroutines for reasoning models.
New named system introduced to organize the decomposition and retrieval pipeline.

pith-pipeline@v0.9.0 · 5558 in / 1511 out tokens · 51952 ms · 2026-05-13T22:01:42.848578+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 15 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

NLTK : The natural language toolkit

Steven Bird and Edward Loper. NLTK : The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions , pages 214--217, Barcelona, Spain, July 2004. Association for Computational Linguistics. https://aclanthology.org/P04-3031/

work page 2004
[3]

Teaching large language models to self-debug

Xinyun Chen, Maxwell Lin, Nathanael Sch \" a rli, and Denny Zhou. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. https://openreview.net/forum?id=KuPixIqPiq

work page 2024
[4]

DeepSeek-V3 Technical Report

DeepSeek - AI. Deepseek-v3 technical report. CoRR, abs/2412.19437, 2024. doi:10.48550/ARXIV.2412.19437. https://doi.org/10.48550/arXiv.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek - AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. doi:10.48550/ARXIV.2501.12948. https://doi.org/10.48550/arXiv.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[6]

Interpretable contrastive monte carlo tree search reason- ing.arXiv preprint arXiv:2410.01707, 2024

Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, and Lijie Wen. Interpretable contrastive monte carlo tree search reasoning. CoRR, abs/2410.01707, 2024. doi:10.48550/ARXIV.2410.01707. https://doi.org/10.48550/arXiv.2410.01707

work page doi:10.48550/arxiv.2410.01707 2024
[7]

Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng - ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.04178 2025
[8]

REALM: Retrieval-Augmented Language Model Pre-Training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming - Wei Chang. REALM: retrieval-augmented language model pre-training. CoRR, abs/2002.08909, 2020. https://arxiv.org/abs/2002.08909

work page internal anchor Pith review arXiv 2002
[9]

Don't overthink it

Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don't overthink it. preferring shorter thinking chains for improved LLM reasoning. CoRR, abs/2505.17813, 2025. doi:10.48550/ARXIV.2505.17813. https://doi.org/10.48550/arXiv.2505.17813

work page doi:10.48550/arxiv.2505.17813 2025
[10]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai - Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2...

work page 2021
[11]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen - Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar - Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.ne...

work page 2025
[12]

Active retrieval augmented generation

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi - Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pag...

work page doi:10.18653/v1/2023.emnlp-main.495 2023
[13]

InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors, Proceedings of the 29th Symposium on Operating Systems ...

work page doi:10.1145/3600006.3613165 2023
[14]

12 GEAR: Genetic AutoResearch J

Kuang - Huei Lee, Ian Fischer, Yueh - Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, and Xinyun Chen. Evolving deeper LLM thinking. CoRR, abs/2501.09891, 2025. doi:10.48550/ARXIV.2501.09891. https://doi.org/10.48550/arXiv.2501.09891

work page doi:10.48550/arxiv.2501.09891 2025
[15]

u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \" u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan...

work page 2020
[16]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. CoRR, abs/2501.05366, 2025 a . doi:10.48550/ARXIV.2501.05366. https://doi.org/10.48550/arXiv.2501.05366

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.05366 2025
[17]

Webthinker: Empowering large reasoning models with deep research capability,

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji - Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability. CoRR, abs/2504.21776, 2025 b . doi:10.48550/ARXIV.2504.21776. https://doi.org/10.48550/arXiv.2504.21776

work page doi:10.48550/arxiv.2504.21776 2025
[18]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. https://openreview.net/forum?id=v8L0pN6EOi

work page 2024
[19]

The Llama 3 Herd of Models

Llama Team . The llama 3 herd of models. CoRR, abs/2407.21783, 2024. doi:10.48550/ARXIV.2407.21783. https://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[20]

Frustratingly simple retrieval improves challenging, reasoning-intensive benchmarks

Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, and Sewon Min. Frustratingly simple retrieval improves challenging, reasoning-intensive benchmarks. CoRR, abs/2507.01297, 2025. doi:10.48550/ARXIV.2507.01297. https://doi.org/10.48550/arXiv.2507.01297

work page doi:10.48550/arxiv.2507.01297 2025
[21]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, K...

work page 2023
[22]

American Invitational Mathematics Examination (AIME)

Mathematical Association of America . American Invitational Mathematics Examination (AIME) . https://maa.org/math-competitions/american-invitational-mathematics-examination-aime, February 2024. American Invitational Mathematics Examination (AIME) 2024

work page 2024
[23]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei - Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Cand \` e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. CoRR, abs/2501.19393, 2025. doi:10.48550/ARXIV.2501.19393. https://doi.org/10.48550/arXiv.2501.19393

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.19393 2025
[24]

Nemotron-Post-Training-Dataset-v1 , 2025

Dhruv Nathawani, Igor Gitman, Somshubra Majumdar, Evelina Bakhturina, Ameya Sunil Mahabaleshwarkar, , Jian Zhang, and Jane Polak Scowcroft. Nemotron-Post-Training-Dataset-v1 , 2025. https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1

work page 2025
[25]

OpenAI o1 System Card

OpenAI. Openai o1 system card and L earning to reason with LLM s. https://arxiv.org/abs/2412.16720 and https://openai.com/index/learning-to-reason-with-llms, 2024. Technical reports accompanying the OpenAI o1 and o1-mini reasoning models

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Introducing GPT-5.2

OpenAI . Introducing GPT-5.2 . https://openai.com/index/introducing-gpt-5-2/, December 2025. Accessed: 2025-12-21

work page 2025
[27]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I - Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen - Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. CoRR, abs/2509.25140, 2025 a . doi:10.48550/ARXIV.2509.251...

work page internal anchor Pith review doi:10.48550/arxiv.2509.25140 2025
[28]

RAST: reasoning activation in llms via small-model transfer

Siru Ouyang, Xinyu Zhu, Zilin Xiao, Minhao Jiang, Yu Meng, and Jiawei Han. RAST: reasoning activation in llms via small-model transfer. CoRR, abs/2506.15710, 2025 b . doi:10.48550/ARXIV.2506.15710. https://doi.org/10.48550/arXiv.2506.15710

work page doi:10.48550/arxiv.2506.15710 2025
[29]

Ensembling large language models with process reward-guided tree search for better complex reasoning

Sungjin Park, Xiao Liu, Yeyun Gong, and Edward Choi. Ensembling large language models with process reward-guided tree search for better complex reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page doi:10.18653/v1/2025.naacl-long.515 2025
[30]

Rlad: Training llms to discover abstractions for solving reasoning problems.arXiv preprint arXiv:2510.02263, 2025

Yuxiao Qu, Anikait Singh, Yoonho Lee, Amrith Setlur, Ruslan Salakhutdinov, Chelsea Finn, and Aviral Kumar. RLAD: training llms to discover abstractions for solving reasoning problems. CoRR, abs/2510.02263, 2025. doi:10.48550/ARXIV.2510.02263. https://doi.org/10.48550/arXiv.2510.02263

work page doi:10.48550/arxiv.2510.02263 2025
[31]

Qwq: A family of open reasoning models

Qwen Team . Qwq: A family of open reasoning models. https://qwenlm.github.io/blog/qwq-32b-preview/, 2024. Technical report and model card for the QwQ reasoning model family

work page 2024
[32]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q & a benchmark. CoRR, abs/2311.12022, 2023. doi:10.48550/ARXIV.2311.12022. https://doi.org/10.48550/arXiv.2311.12022

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12022 2023
[33]

Reasonir: Training retrievers for reasoning tasks.arXiv preprint arXiv:2504.20595,

Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen - tau Yih, Pang Wei Koh, and Luke Zettlemoyer. Reasonir: Training retrievers for reasoning tasks. CoRR, abs/2504.20595, 2025. doi:10.48550/ARXIV.2504.20595. https://doi.org/10.48550/arXiv.2504.20595

work page doi:10.48550/arxiv.2504.20595 2025
[34]

REPLUG: retrieval-augmented black-box language models

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen - tau Yih. REPLUG: retrieval-augmented black-box language models. In Kevin Duh, Helena G \' o mez - Adorno, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguisti...

work page doi:10.18653/v1/2024.naacl-long.463 2024
[35]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. CoRR, abs/2408.03314, 2024. doi:10.48550/ARXIV.2408.03314. https://doi.org/10.48550/arXiv.2408.03314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.03314 2024
[36]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji - Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. CoRR, abs/2503.05592, 2025. doi:10.48550/ARXIV.2503.05592. https://doi.org/10.48550/arXiv.2503.05592

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.05592 2025
[37]

Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A...

work page doi:10.18653/v1/2024.acl-long.510 2024
[38]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=1...

work page 2023
[39]

Agent Workflow Memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. CoRR, abs/2409.07429, 2024 b . doi:10.48550/ARXIV.2409.07429. https://doi.org/10.48550/arXiv.2409.07429

work page internal anchor Pith review doi:10.48550/arxiv.2409.07429 2024
[40]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neura...

work page 2022
[41]

Scalable chain of thoughts via elastic reasoning

Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, and Caiming Xiong. Scalable chain of thoughts via elastic reasoning. CoRR, abs/2505.05315, 2025. doi:10.48550/ARXIV.2505.05315. https://doi.org/10.48550/arXiv.2505.05315

work page doi:10.48550/arxiv.2505.05315 2025
[42]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
[43]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[44]

Gonzalez, and Bin Cui

Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, and Bin Cui. Buffer of thoughts: Thought-augmented reasoning with large language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing System...

work page 2024
[46]

Reasonflux: Hierarchical llm reasoning via scaling thought templates

Ling Yang, Zhaochen Yu, Bin Cui, and Mengdi Wang. Reasonflux: Hierarchical LLM reasoning via scaling thought templates. CoRR, abs/2502.06772, 2025 c . doi:10.48550/ARXIV.2502.06772. https://doi.org/10.48550/arXiv.2502.06772

work page doi:10.48550/arxiv.2502.06772 2025
[47]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...

work page 2023
[48]

Le, Ed H

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng - Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. SELF-DISCOVER: large language models self-compose reasoning structures. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural ...

work page 2024