pith. machine review for the scientific record. sign in

arxiv: 2604.01348 · v2 · submitted 2026-04-01 · 💻 cs.CL

Recognition: no theorem link

Procedural Knowledge at Scale Improves Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords reasoning memoryprocedural knowledgeretrieval augmented generationtest-time scalingsubquestion decompositionmath benchmarkscoding taskslanguage model reasoning
0
0 comments X

The pith

Decomposing reasoning trajectories into subquestion-subroutine pairs and retrieving them during inference boosts performance on math, science, and coding tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that language models can improve at hard reasoning by systematically reusing pieces of how-to knowledge extracted from earlier solutions rather than treating each new problem in isolation. Existing step-by-step trajectories are broken into compact, self-contained subquestion-subroutine pairs that form a large datastore. At test time a lightweight prompt lets the model surface the current subquestion, pull relevant subroutines, and continue reasoning with those pieces as guides. This Reasoning Memory approach outperforms retrieval of full documents, full trajectories, or templates as well as a compute-matched scaling baseline, reaching gains of up to 19.2 percent over no retrieval and 7.9 percent over the strongest baseline. The improvements trace to the breadth of procedural coverage and the decomposition-retrieval design that lets the model draw on diverse prior ways of reframing, approaching, and verifying problems.

Core claim

Reasoning Memory starts from existing corpora of step-by-step reasoning trajectories, decomposes each trajectory into self-contained subquestion-subroutine pairs to create a datastore of 32 million compact procedural knowledge entries, and at inference time uses a lightweight in-thought prompt to let the model verbalize the core subquestion, retrieve relevant subroutines, and reason under them as implicit procedural priors. Across six math, science, and coding benchmarks this consistently outperforms RAG with document, trajectory, or template knowledge and a compute-matched test-time scaling baseline, with higher inference budgets yielding up to 19.2 percent improvement over no retrieval and

What carries the argument

Reasoning Memory: a retrieval-augmented generation framework that decomposes reasoning trajectories into subquestion-subroutine pairs to supply procedural knowledge at inference time.

If this is right

  • Models gain from reusing how-to steps extracted from past solutions instead of solving every problem from scratch.
  • Broad coverage of procedural patterns across large trajectory corpora drives larger gains than retrieval of facts or complete examples.
  • Higher inference budgets amplify the benefit of retrieving and conditioning on procedural subroutines.
  • The specific decomposition into subquestion-subroutine pairs enables cleaner extraction and safer reuse than whole-trajectory retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition approach could be applied to planning or multi-step tool-use tasks where procedural reuse matters more than factual lookup.
  • Models might close the loop by adding newly generated successful trajectories back into the datastore after each run.
  • Scaling the size and diversity of the procedural datastore could offer a cheaper alternative to scaling model parameters for reasoning gains.

Load-bearing premise

Decomposing existing reasoning trajectories into self-contained subquestion-subroutine pairs produces procedural knowledge that remains useful and non-misleading when retrieved and inserted into new reasoning traces.

What would settle it

A controlled test in which the retrieved subroutines are deliberately inserted into the model's trace and the resulting accuracy drops below the no-retrieval baseline on the same benchmarks.

read the original abstract

Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Reasoning Memory, a RAG framework that decomposes existing step-by-step reasoning trajectories into a datastore of 32 million self-contained subquestion-subroutine pairs. At inference time, a lightweight in-thought prompt enables the model to verbalize subquestions, retrieve relevant procedural knowledge, and reason under the retrieved subroutines as implicit priors. The authors report consistent outperformance over RAG variants using documents, full trajectories, or templates, as well as compute-matched test-time scaling baselines, with gains reaching 19.2% over no-retrieval and 7.9% over the strongest baseline across six math, science, and coding benchmarks. Ablations attribute the improvements to broad procedural coverage and the decomposition/retrieval design.

Significance. If the central empirical claims hold after addressing verification gaps, the work would establish that large-scale extraction and reuse of procedural knowledge from trajectories can deliver meaningful test-time gains in reasoning without retraining. The scale of the 32M-entry datastore and the consistent cross-domain improvements would position this as a practical complement to existing test-time scaling methods, potentially influencing how future systems store and retrieve reusable reasoning patterns.

major comments (2)
  1. [Abstract and experimental results] Abstract and experimental results: the performance claims (19.2% and 7.9% lifts) are presented without error bars, statistical significance tests, or detailed baseline implementation descriptions (e.g., exact retrieval parameters or prompt lengths for the compute-matched scaling baseline). This makes it impossible to assess whether the reported gains are robust or could be explained by confounds such as increased context length or retrieval noise.
  2. [Experimental results] Experimental results: no per-retrieval or per-insertion analysis is provided to show that retrieved subroutines are actually used by the model, remain non-misleading, or transfer usefully to new problems. Without such evidence (e.g., manual inspection of traces or metrics on subroutine relevance/usage), the attribution of gains specifically to procedural knowledge reuse rather than prompt diversity remains unverified and load-bearing for the central claim.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'lightweight in-thought prompt' is introduced without a brief definition or example; a short illustrative snippet would improve immediate clarity.
  2. [Method] The manuscript would benefit from an explicit statement of the total number of trajectories used to build the 32M-entry datastore and the decomposition algorithm's exact criteria for 'self-contained' pairs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve experimental rigor and provide additional verification of the mechanism.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results: the performance claims (19.2% and 7.9% lifts) are presented without error bars, statistical significance tests, or detailed baseline implementation descriptions (e.g., exact retrieval parameters or prompt lengths for the compute-matched scaling baseline). This makes it impossible to assess whether the reported gains are robust or could be explained by confounds such as increased context length or retrieval noise.

    Authors: We agree that error bars, statistical significance tests, and fuller baseline details are needed to demonstrate robustness. In the revised version we will report standard deviations over multiple random seeds for all main results, include paired statistical tests (e.g., t-tests) against baselines, and expand the experimental setup and appendix with precise retrieval parameters (top-k, embedding model, similarity threshold), exact prompt lengths, and total token budgets for every baseline including the compute-matched test-time scaling condition. We will also add a controlled ablation that matches total context length across methods to rule out length-related confounds. revision: yes

  2. Referee: [Experimental results] Experimental results: no per-retrieval or per-insertion analysis is provided to show that retrieved subroutines are actually used by the model, remain non-misleading, or transfer usefully to new problems. Without such evidence (e.g., manual inspection of traces or metrics on subroutine relevance/usage), the attribution of gains specifically to procedural knowledge reuse rather than prompt diversity remains unverified and load-bearing for the central claim.

    Authors: We acknowledge the value of direct evidence that the model actually consults and benefits from the retrieved subroutines. We will add a new analysis subsection containing (1) manual inspection of 100 randomly sampled reasoning traces with counts of explicit references to retrieved subroutines, (2) a quantitative relevance metric (semantic similarity between verbalized subquestion and retrieved subroutine), and (3) an ablation measuring performance drop when retrieved items are replaced by random or irrelevant subroutines. These additions will help confirm that gains arise from procedural reuse rather than prompt diversity alone. Our existing decomposition ablations already isolate the design contribution, but the requested per-retrieval diagnostics will strengthen the mechanistic claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical RAG framework

full rationale

The paper describes an empirical method: decompose existing reasoning trajectories into subquestion-subroutine pairs to build a 32M-entry datastore, then retrieve and insert them via in-thought prompts at inference time. Performance claims (up to 19.2% over no-retrieval, 7.9% over compute-matched baselines) rest on benchmark evaluations and ablations attributing gains to procedural coverage plus decomposition/retrieval design. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The datastore is constructed from prior corpora and evaluated on separate benchmarks, keeping the derivation self-contained against external results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that procedural knowledge in reasoning trajectories can be usefully decomposed into independent subquestion-subroutine pairs and that retrieval of those pairs provides non-harmful guidance during new inference. No free parameters are explicitly fitted in the abstract description, and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Reasoning trajectories contain reusable procedural knowledge that can be decomposed into self-contained subquestion-subroutine pairs without loss of utility.
    Invoked in the description of datastore construction and the claim that gains come from broad procedural coverage.
invented entities (1)
  • Reasoning Memory framework no independent evidence
    purpose: Retrieval-augmented generation system that stores and retrieves procedural subroutines for reasoning models.
    New named system introduced to organize the decomposition and retrieval pipeline.

pith-pipeline@v0.9.0 · 5558 in / 1511 out tokens · 51952 ms · 2026-05-13T22:01:42.848578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 15 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    NLTK : The natural language toolkit

    Steven Bird and Edward Loper. NLTK : The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions , pages 214--217, Barcelona, Spain, July 2004. Association for Computational Linguistics. https://aclanthology.org/P04-3031/

  3. [3]

    Teaching large language models to self-debug

    Xinyun Chen, Maxwell Lin, Nathanael Sch \" a rli, and Denny Zhou. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. https://openreview.net/forum?id=KuPixIqPiq

  4. [4]

    DeepSeek-V3 Technical Report

    DeepSeek - AI. Deepseek-v3 technical report. CoRR, abs/2412.19437, 2024. doi:10.48550/ARXIV.2412.19437. https://doi.org/10.48550/arXiv.2412.19437

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek - AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. doi:10.48550/ARXIV.2501.12948. https://doi.org/10.48550/arXiv.2501.12948

  6. [6]

    Interpretable contrastive monte carlo tree search reason- ing.arXiv preprint arXiv:2410.01707, 2024

    Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, and Lijie Wen. Interpretable contrastive monte carlo tree search reasoning. CoRR, abs/2410.01707, 2024. doi:10.48550/ARXIV.2410.01707. https://doi.org/10.48550/arXiv.2410.01707

  7. [7]

    Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng - ...

  8. [8]

    REALM: Retrieval-Augmented Language Model Pre-Training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming - Wei Chang. REALM: retrieval-augmented language model pre-training. CoRR, abs/2002.08909, 2020. https://arxiv.org/abs/2002.08909

  9. [9]

    Don't overthink it

    Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don't overthink it. preferring shorter thinking chains for improved LLM reasoning. CoRR, abs/2505.17813, 2025. doi:10.48550/ARXIV.2505.17813. https://doi.org/10.48550/arXiv.2505.17813

  10. [10]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai - Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2...

  11. [11]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen - Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar - Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.ne...

  12. [12]

    Active retrieval augmented generation

    Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi - Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pag...

  13. [13]

    InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors, Proceedings of the 29th Symposium on Operating Systems ...

  14. [14]

    12 GEAR: Genetic AutoResearch J

    Kuang - Huei Lee, Ian Fischer, Yueh - Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, and Xinyun Chen. Evolving deeper LLM thinking. CoRR, abs/2501.09891, 2025. doi:10.48550/ARXIV.2501.09891. https://doi.org/10.48550/arXiv.2501.09891

  15. [15]

    u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \" u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan...

  16. [16]

    Search-o1: Agentic Search-Enhanced Large Reasoning Models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. CoRR, abs/2501.05366, 2025 a . doi:10.48550/ARXIV.2501.05366. https://doi.org/10.48550/arXiv.2501.05366

  17. [17]

    Webthinker: Empowering large reasoning models with deep research capability,

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji - Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability. CoRR, abs/2504.21776, 2025 b . doi:10.48550/ARXIV.2504.21776. https://doi.org/10.48550/arXiv.2504.21776

  18. [18]

    Let's verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. https://openreview.net/forum?id=v8L0pN6EOi

  19. [19]

    The Llama 3 Herd of Models

    Llama Team . The llama 3 herd of models. CoRR, abs/2407.21783, 2024. doi:10.48550/ARXIV.2407.21783. https://doi.org/10.48550/arXiv.2407.21783

  20. [20]

    Frustratingly simple retrieval improves challenging, reasoning-intensive benchmarks

    Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, and Sewon Min. Frustratingly simple retrieval improves challenging, reasoning-intensive benchmarks. CoRR, abs/2507.01297, 2025. doi:10.48550/ARXIV.2507.01297. https://doi.org/10.48550/arXiv.2507.01297

  21. [21]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, K...

  22. [22]

    American Invitational Mathematics Examination (AIME)

    Mathematical Association of America . American Invitational Mathematics Examination (AIME) . https://maa.org/math-competitions/american-invitational-mathematics-examination-aime, February 2024. American Invitational Mathematics Examination (AIME) 2024

  23. [23]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei - Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Cand \` e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. CoRR, abs/2501.19393, 2025. doi:10.48550/ARXIV.2501.19393. https://doi.org/10.48550/arXiv.2501.19393

  24. [24]

    Nemotron-Post-Training-Dataset-v1 , 2025

    Dhruv Nathawani, Igor Gitman, Somshubra Majumdar, Evelina Bakhturina, Ameya Sunil Mahabaleshwarkar, , Jian Zhang, and Jane Polak Scowcroft. Nemotron-Post-Training-Dataset-v1 , 2025. https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1

  25. [25]

    OpenAI o1 System Card

    OpenAI. Openai o1 system card and L earning to reason with LLM s. https://arxiv.org/abs/2412.16720 and https://openai.com/index/learning-to-reason-with-llms, 2024. Technical reports accompanying the OpenAI o1 and o1-mini reasoning models

  26. [26]

    Introducing GPT-5.2

    OpenAI . Introducing GPT-5.2 . https://openai.com/index/introducing-gpt-5-2/, December 2025. Accessed: 2025-12-21

  27. [27]

    ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

    Siru Ouyang, Jun Yan, I - Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen - Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. CoRR, abs/2509.25140, 2025 a . doi:10.48550/ARXIV.2509.251...

  28. [28]

    RAST: reasoning activation in llms via small-model transfer

    Siru Ouyang, Xinyu Zhu, Zilin Xiao, Minhao Jiang, Yu Meng, and Jiawei Han. RAST: reasoning activation in llms via small-model transfer. CoRR, abs/2506.15710, 2025 b . doi:10.48550/ARXIV.2506.15710. https://doi.org/10.48550/arXiv.2506.15710

  29. [29]

    Ensembling large language models with process reward-guided tree search for better complex reasoning

    Sungjin Park, Xiao Liu, Yeyun Gong, and Edward Choi. Ensembling large language models with process reward-guided tree search for better complex reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...

  30. [30]

    Rlad: Training llms to discover abstractions for solving reasoning problems.arXiv preprint arXiv:2510.02263, 2025

    Yuxiao Qu, Anikait Singh, Yoonho Lee, Amrith Setlur, Ruslan Salakhutdinov, Chelsea Finn, and Aviral Kumar. RLAD: training llms to discover abstractions for solving reasoning problems. CoRR, abs/2510.02263, 2025. doi:10.48550/ARXIV.2510.02263. https://doi.org/10.48550/arXiv.2510.02263

  31. [31]

    Qwq: A family of open reasoning models

    Qwen Team . Qwq: A family of open reasoning models. https://qwenlm.github.io/blog/qwq-32b-preview/, 2024. Technical report and model card for the QwQ reasoning model family

  32. [32]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q & a benchmark. CoRR, abs/2311.12022, 2023. doi:10.48550/ARXIV.2311.12022. https://doi.org/10.48550/arXiv.2311.12022

  33. [33]

    Reasonir: Training retrievers for reasoning tasks.arXiv preprint arXiv:2504.20595,

    Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen - tau Yih, Pang Wei Koh, and Luke Zettlemoyer. Reasonir: Training retrievers for reasoning tasks. CoRR, abs/2504.20595, 2025. doi:10.48550/ARXIV.2504.20595. https://doi.org/10.48550/arXiv.2504.20595

  34. [34]

    REPLUG: retrieval-augmented black-box language models

    Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen - tau Yih. REPLUG: retrieval-augmented black-box language models. In Kevin Duh, Helena G \' o mez - Adorno, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguisti...

  35. [35]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. CoRR, abs/2408.03314, 2024. doi:10.48550/ARXIV.2408.03314. https://doi.org/10.48550/arXiv.2408.03314

  36. [36]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji - Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. CoRR, abs/2503.05592, 2025. doi:10.48550/ARXIV.2503.05592. https://doi.org/10.48550/arXiv.2503.05592

  37. [37]

    Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A...

  38. [38]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=1...

  39. [39]

    Agent Workflow Memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. CoRR, abs/2409.07429, 2024 b . doi:10.48550/ARXIV.2409.07429. https://doi.org/10.48550/arXiv.2409.07429

  40. [40]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neura...

  41. [41]

    Scalable chain of thoughts via elastic reasoning

    Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, and Caiming Xiong. Scalable chain of thoughts via elastic reasoning. CoRR, abs/2505.05315, 2025. doi:10.48550/ARXIV.2505.05315. https://doi.org/10.48550/arXiv.2505.05315

  42. [42]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  43. [43]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

  44. [44]

    Gonzalez, and Bin Cui

    Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, and Bin Cui. Buffer of thoughts: Thought-augmented reasoning with large language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing System...

  45. [46]

    Reasonflux: Hierarchical llm reasoning via scaling thought templates

    Ling Yang, Zhaochen Yu, Bin Cui, and Mengdi Wang. Reasonflux: Hierarchical LLM reasoning via scaling thought templates. CoRR, abs/2502.06772, 2025 c . doi:10.48550/ARXIV.2502.06772. https://doi.org/10.48550/arXiv.2502.06772

  46. [47]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...

  47. [48]

    Le, Ed H

    Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng - Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. SELF-DISCOVER: large language models self-compose reasoning structures. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural ...