Recognition: no theorem link
Procedural Knowledge at Scale Improves Reasoning
Pith reviewed 2026-05-13 22:01 UTC · model grok-4.3
The pith
Decomposing reasoning trajectories into subquestion-subroutine pairs and retrieving them during inference boosts performance on math, science, and coding tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning Memory starts from existing corpora of step-by-step reasoning trajectories, decomposes each trajectory into self-contained subquestion-subroutine pairs to create a datastore of 32 million compact procedural knowledge entries, and at inference time uses a lightweight in-thought prompt to let the model verbalize the core subquestion, retrieve relevant subroutines, and reason under them as implicit procedural priors. Across six math, science, and coding benchmarks this consistently outperforms RAG with document, trajectory, or template knowledge and a compute-matched test-time scaling baseline, with higher inference budgets yielding up to 19.2 percent improvement over no retrieval and
What carries the argument
Reasoning Memory: a retrieval-augmented generation framework that decomposes reasoning trajectories into subquestion-subroutine pairs to supply procedural knowledge at inference time.
If this is right
- Models gain from reusing how-to steps extracted from past solutions instead of solving every problem from scratch.
- Broad coverage of procedural patterns across large trajectory corpora drives larger gains than retrieval of facts or complete examples.
- Higher inference budgets amplify the benefit of retrieving and conditioning on procedural subroutines.
- The specific decomposition into subquestion-subroutine pairs enables cleaner extraction and safer reuse than whole-trajectory retrieval.
Where Pith is reading between the lines
- The same decomposition approach could be applied to planning or multi-step tool-use tasks where procedural reuse matters more than factual lookup.
- Models might close the loop by adding newly generated successful trajectories back into the datastore after each run.
- Scaling the size and diversity of the procedural datastore could offer a cheaper alternative to scaling model parameters for reasoning gains.
Load-bearing premise
Decomposing existing reasoning trajectories into self-contained subquestion-subroutine pairs produces procedural knowledge that remains useful and non-misleading when retrieved and inserted into new reasoning traces.
What would settle it
A controlled test in which the retrieved subroutines are deliberately inserted into the model's trace and the resulting accuracy drops below the no-retrieval baseline on the same benchmarks.
read the original abstract
Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Reasoning Memory, a RAG framework that decomposes existing step-by-step reasoning trajectories into a datastore of 32 million self-contained subquestion-subroutine pairs. At inference time, a lightweight in-thought prompt enables the model to verbalize subquestions, retrieve relevant procedural knowledge, and reason under the retrieved subroutines as implicit priors. The authors report consistent outperformance over RAG variants using documents, full trajectories, or templates, as well as compute-matched test-time scaling baselines, with gains reaching 19.2% over no-retrieval and 7.9% over the strongest baseline across six math, science, and coding benchmarks. Ablations attribute the improvements to broad procedural coverage and the decomposition/retrieval design.
Significance. If the central empirical claims hold after addressing verification gaps, the work would establish that large-scale extraction and reuse of procedural knowledge from trajectories can deliver meaningful test-time gains in reasoning without retraining. The scale of the 32M-entry datastore and the consistent cross-domain improvements would position this as a practical complement to existing test-time scaling methods, potentially influencing how future systems store and retrieve reusable reasoning patterns.
major comments (2)
- [Abstract and experimental results] Abstract and experimental results: the performance claims (19.2% and 7.9% lifts) are presented without error bars, statistical significance tests, or detailed baseline implementation descriptions (e.g., exact retrieval parameters or prompt lengths for the compute-matched scaling baseline). This makes it impossible to assess whether the reported gains are robust or could be explained by confounds such as increased context length or retrieval noise.
- [Experimental results] Experimental results: no per-retrieval or per-insertion analysis is provided to show that retrieved subroutines are actually used by the model, remain non-misleading, or transfer usefully to new problems. Without such evidence (e.g., manual inspection of traces or metrics on subroutine relevance/usage), the attribution of gains specifically to procedural knowledge reuse rather than prompt diversity remains unverified and load-bearing for the central claim.
minor comments (2)
- [Abstract] Abstract: the phrase 'lightweight in-thought prompt' is introduced without a brief definition or example; a short illustrative snippet would improve immediate clarity.
- [Method] The manuscript would benefit from an explicit statement of the total number of trajectories used to build the 32M-entry datastore and the decomposition algorithm's exact criteria for 'self-contained' pairs.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve experimental rigor and provide additional verification of the mechanism.
read point-by-point responses
-
Referee: [Abstract and experimental results] Abstract and experimental results: the performance claims (19.2% and 7.9% lifts) are presented without error bars, statistical significance tests, or detailed baseline implementation descriptions (e.g., exact retrieval parameters or prompt lengths for the compute-matched scaling baseline). This makes it impossible to assess whether the reported gains are robust or could be explained by confounds such as increased context length or retrieval noise.
Authors: We agree that error bars, statistical significance tests, and fuller baseline details are needed to demonstrate robustness. In the revised version we will report standard deviations over multiple random seeds for all main results, include paired statistical tests (e.g., t-tests) against baselines, and expand the experimental setup and appendix with precise retrieval parameters (top-k, embedding model, similarity threshold), exact prompt lengths, and total token budgets for every baseline including the compute-matched test-time scaling condition. We will also add a controlled ablation that matches total context length across methods to rule out length-related confounds. revision: yes
-
Referee: [Experimental results] Experimental results: no per-retrieval or per-insertion analysis is provided to show that retrieved subroutines are actually used by the model, remain non-misleading, or transfer usefully to new problems. Without such evidence (e.g., manual inspection of traces or metrics on subroutine relevance/usage), the attribution of gains specifically to procedural knowledge reuse rather than prompt diversity remains unverified and load-bearing for the central claim.
Authors: We acknowledge the value of direct evidence that the model actually consults and benefits from the retrieved subroutines. We will add a new analysis subsection containing (1) manual inspection of 100 randomly sampled reasoning traces with counts of explicit references to retrieved subroutines, (2) a quantitative relevance metric (semantic similarity between verbalized subquestion and retrieved subroutine), and (3) an ablation measuring performance drop when retrieved items are replaced by random or irrelevant subroutines. These additions will help confirm that gains arise from procedural reuse rather than prompt diversity alone. Our existing decomposition ablations already isolate the design contribution, but the requested per-retrieval diagnostics will strengthen the mechanistic claim. revision: yes
Circularity Check
No significant circularity in empirical RAG framework
full rationale
The paper describes an empirical method: decompose existing reasoning trajectories into subquestion-subroutine pairs to build a 32M-entry datastore, then retrieve and insert them via in-thought prompts at inference time. Performance claims (up to 19.2% over no-retrieval, 7.9% over compute-matched baselines) rest on benchmark evaluations and ablations attributing gains to procedural coverage plus decomposition/retrieval design. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The datastore is constructed from prior corpora and evaluated on separate benchmarks, keeping the derivation self-contained against external results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reasoning trajectories contain reusable procedural knowledge that can be decomposed into self-contained subquestion-subroutine pairs without loss of utility.
invented entities (1)
-
Reasoning Memory framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
NLTK : The natural language toolkit
Steven Bird and Edward Loper. NLTK : The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions , pages 214--217, Barcelona, Spain, July 2004. Association for Computational Linguistics. https://aclanthology.org/P04-3031/
work page 2004
-
[3]
Teaching large language models to self-debug
Xinyun Chen, Maxwell Lin, Nathanael Sch \" a rli, and Denny Zhou. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. https://openreview.net/forum?id=KuPixIqPiq
work page 2024
-
[4]
DeepSeek - AI. Deepseek-v3 technical report. CoRR, abs/2412.19437, 2024. doi:10.48550/ARXIV.2412.19437. https://doi.org/10.48550/arXiv.2412.19437
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek - AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, abs/2501.12948, 2025. doi:10.48550/ARXIV.2501.12948. https://doi.org/10.48550/arXiv.2501.12948
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
-
[6]
Interpretable contrastive monte carlo tree search reason- ing.arXiv preprint arXiv:2410.01707, 2024
Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, and Lijie Wen. Interpretable contrastive monte carlo tree search reasoning. CoRR, abs/2410.01707, 2024. doi:10.48550/ARXIV.2410.01707. https://doi.org/10.48550/arXiv.2410.01707
-
[7]
Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng - ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.04178 2025
-
[8]
REALM: Retrieval-Augmented Language Model Pre-Training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming - Wei Chang. REALM: retrieval-augmented language model pre-training. CoRR, abs/2002.08909, 2020. https://arxiv.org/abs/2002.08909
work page internal anchor Pith review arXiv 2002
-
[9]
Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don't overthink it. preferring shorter thinking chains for improved LLM reasoning. CoRR, abs/2505.17813, 2025. doi:10.48550/ARXIV.2505.17813. https://doi.org/10.48550/arXiv.2505.17813
-
[10]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai - Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2...
work page 2021
-
[11]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen - Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar - Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.ne...
work page 2025
-
[12]
Active retrieval augmented generation
Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi - Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pag...
-
[13]
InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23)
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors, Proceedings of the 29th Symposium on Operating Systems ...
-
[14]
12 GEAR: Genetic AutoResearch J
Kuang - Huei Lee, Ian Fischer, Yueh - Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, and Xinyun Chen. Evolving deeper LLM thinking. CoRR, abs/2501.09891, 2025. doi:10.48550/ARXIV.2501.09891. https://doi.org/10.48550/arXiv.2501.09891
-
[15]
u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \" u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan...
work page 2020
-
[16]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. CoRR, abs/2501.05366, 2025 a . doi:10.48550/ARXIV.2501.05366. https://doi.org/10.48550/arXiv.2501.05366
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.05366 2025
-
[17]
Webthinker: Empowering large reasoning models with deep research capability,
Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji - Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability. CoRR, abs/2504.21776, 2025 b . doi:10.48550/ARXIV.2504.21776. https://doi.org/10.48550/arXiv.2504.21776
-
[18]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. https://openreview.net/forum?id=v8L0pN6EOi
work page 2024
-
[19]
Llama Team . The llama 3 herd of models. CoRR, abs/2407.21783, 2024. doi:10.48550/ARXIV.2407.21783. https://doi.org/10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[20]
Frustratingly simple retrieval improves challenging, reasoning-intensive benchmarks
Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, and Sewon Min. Frustratingly simple retrieval improves challenging, reasoning-intensive benchmarks. CoRR, abs/2507.01297, 2025. doi:10.48550/ARXIV.2507.01297. https://doi.org/10.48550/arXiv.2507.01297
-
[21]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, K...
work page 2023
-
[22]
American Invitational Mathematics Examination (AIME)
Mathematical Association of America . American Invitational Mathematics Examination (AIME) . https://maa.org/math-competitions/american-invitational-mathematics-examination-aime, February 2024. American Invitational Mathematics Examination (AIME) 2024
work page 2024
-
[23]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei - Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J. Cand \` e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. CoRR, abs/2501.19393, 2025. doi:10.48550/ARXIV.2501.19393. https://doi.org/10.48550/arXiv.2501.19393
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.19393 2025
-
[24]
Nemotron-Post-Training-Dataset-v1 , 2025
Dhruv Nathawani, Igor Gitman, Somshubra Majumdar, Evelina Bakhturina, Ameya Sunil Mahabaleshwarkar, , Jian Zhang, and Jane Polak Scowcroft. Nemotron-Post-Training-Dataset-v1 , 2025. https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1
work page 2025
-
[25]
OpenAI. Openai o1 system card and L earning to reason with LLM s. https://arxiv.org/abs/2412.16720 and https://openai.com/index/learning-to-reason-with-llms, 2024. Technical reports accompanying the OpenAI o1 and o1-mini reasoning models
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
OpenAI . Introducing GPT-5.2 . https://openai.com/index/introducing-gpt-5-2/, December 2025. Accessed: 2025-12-21
work page 2025
-
[27]
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Siru Ouyang, Jun Yan, I - Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen - Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. CoRR, abs/2509.25140, 2025 a . doi:10.48550/ARXIV.2509.251...
work page internal anchor Pith review doi:10.48550/arxiv.2509.25140 2025
-
[28]
RAST: reasoning activation in llms via small-model transfer
Siru Ouyang, Xinyu Zhu, Zilin Xiao, Minhao Jiang, Yu Meng, and Jiawei Han. RAST: reasoning activation in llms via small-model transfer. CoRR, abs/2506.15710, 2025 b . doi:10.48550/ARXIV.2506.15710. https://doi.org/10.48550/arXiv.2506.15710
-
[29]
Ensembling large language models with process reward-guided tree search for better complex reasoning
Sungjin Park, Xiao Liu, Yeyun Gong, and Edward Choi. Ensembling large language models with process reward-guided tree search for better complex reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...
-
[30]
Yuxiao Qu, Anikait Singh, Yoonho Lee, Amrith Setlur, Ruslan Salakhutdinov, Chelsea Finn, and Aviral Kumar. RLAD: training llms to discover abstractions for solving reasoning problems. CoRR, abs/2510.02263, 2025. doi:10.48550/ARXIV.2510.02263. https://doi.org/10.48550/arXiv.2510.02263
-
[31]
Qwq: A family of open reasoning models
Qwen Team . Qwq: A family of open reasoning models. https://qwenlm.github.io/blog/qwq-32b-preview/, 2024. Technical report and model card for the QwQ reasoning model family
work page 2024
-
[32]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q & a benchmark. CoRR, abs/2311.12022, 2023. doi:10.48550/ARXIV.2311.12022. https://doi.org/10.48550/arXiv.2311.12022
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12022 2023
-
[33]
Reasonir: Training retrievers for reasoning tasks.arXiv preprint arXiv:2504.20595,
Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen - tau Yih, Pang Wei Koh, and Luke Zettlemoyer. Reasonir: Training retrievers for reasoning tasks. CoRR, abs/2504.20595, 2025. doi:10.48550/ARXIV.2504.20595. https://doi.org/10.48550/arXiv.2504.20595
-
[34]
REPLUG: retrieval-augmented black-box language models
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen - tau Yih. REPLUG: retrieval-augmented black-box language models. In Kevin Duh, Helena G \' o mez - Adorno, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguisti...
-
[35]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. CoRR, abs/2408.03314, 2024. doi:10.48550/ARXIV.2408.03314. https://doi.org/10.48550/arXiv.2408.03314
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.03314 2024
-
[36]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji - Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning. CoRR, abs/2503.05592, 2025. doi:10.48550/ARXIV.2503.05592. https://doi.org/10.48550/arXiv.2503.05592
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.05592 2025
-
[37]
Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A...
-
[38]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. https://openreview.net/forum?id=1...
work page 2023
-
[39]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. CoRR, abs/2409.07429, 2024 b . doi:10.48550/ARXIV.2409.07429. https://doi.org/10.48550/arXiv.2409.07429
work page internal anchor Pith review doi:10.48550/arxiv.2409.07429 2024
-
[40]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neura...
work page 2022
-
[41]
Scalable chain of thoughts via elastic reasoning
Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, and Caiming Xiong. Scalable chain of thoughts via elastic reasoning. CoRR, abs/2505.05315, 2025. doi:10.48550/ARXIV.2505.05315. https://doi.org/10.48550/arXiv.2505.05315
-
[42]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
-
[43]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[44]
Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, and Bin Cui. Buffer of thoughts: Thought-augmented reasoning with large language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing System...
work page 2024
-
[46]
Reasonflux: Hierarchical llm reasoning via scaling thought templates
Ling Yang, Zhaochen Yu, Bin Cui, and Mengdi Wang. Reasonflux: Hierarchical LLM reasoning via scaling thought templates. CoRR, abs/2502.06772, 2025 c . doi:10.48550/ARXIV.2502.06772. https://doi.org/10.48550/arXiv.2502.06772
-
[47]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural In...
work page 2023
-
[48]
Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng - Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. SELF-DISCOVER: large language models self-compose reasoning structures. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.