arxiv: 2605.03344 · v1 · submitted 2026-05-05 · 💻 cs.IR · cs.AI· cs.CL

Recognition: unknown

RAG over Thinking Traces Can Improve Reasoning Tasks

Matei Zaharia, Negar Arabzadeh, Sewon Min, Wenjie Ma

Pith reviewed 2026-05-07 14:29 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords retrieval-augmented generationthinking tracesreasoning tasksmath benchmarkscode generationRAGintermediate reasoningT3 method

0 comments

The pith

Retrieving thinking traces from problem-solving attempts improves reasoning performance on math and code benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the common belief that retrieval-augmented generation provides limited value for reasoning-heavy tasks like mathematics and code generation. It shows that the key limitation is the choice of corpus rather than RAG itself, and demonstrates that intermediate thinking trajectories generated while attempting problems serve as an effective retrieval source. A retrieve-then-generate approach using these traces produces consistent gains across strong models on benchmarks such as AIME, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval from standard web corpora. The authors further introduce an offline method called T3 that restructures the traces into more retrieval-friendly forms, unlocking additional performance and sometimes lower inference costs.

Core claim

Retrieving from a corpus of thinking traces generated during problem-solving attempts enables a simple retrieve-then-generate pipeline to improve reasoning performance across benchmarks including AIME 2025-2026, LiveCodeBench, and GPQA-Diamond. The approach outperforms both non-RAG methods and retrieval over web documents, with further gains from the T3 offline transformation that produces structured representations of the traces. Relative improvements reach 56.3 percent on AIME when using traces from Gemini-2-thinking with Gemini-2.5-Flash, and the method works even when the trace-generating model differs from the one answering new queries.

What carries the argument

Thinking traces, the intermediate reasoning trajectories produced while attempting to solve problems, used directly as the retrieval corpus in a RAG pipeline, together with the T3 offline method that converts raw traces into structured, compact, and diagnostic representations for improved matching and usability.

If this is right

RAG over thinking traces outperforms retrieval from standard web corpora on reasoning benchmarks.
Gains persist even when traces come from an earlier model and are applied to more recent models.
The T3-structured version of the corpus can reduce inference cost by up to 15 percent while raising accuracy.
Consistent improvements appear across model scales and across math, code, and science benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If traces carry reusable reasoning patterns, then curating large shared libraries of high-quality thinking traces could become a practical way to augment future models without retraining.
The results suggest that external retrieval of intermediate steps may sometimes be cheaper and more reliable than forcing a model to regenerate every reasoning path from scratch.
This technique could be tested on domains outside math and code, such as scientific hypothesis generation, where intermediate reasoning steps are also recorded.

Load-bearing premise

Thinking traces generated during problem-solving attempts contain generalizable, high-quality reasoning signals that transfer usefully to new problems and different models without introducing systematic errors or biases from the trace-generation process itself.

What would settle it

Running the same retrieve-then-generate experiments on AIME or LiveCodeBench but replacing the thinking-traces corpus with either randomly ordered traces or traces drawn from unrelated problem domains, and observing whether performance gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.03344 by Matei Zaharia, Negar Arabzadeh, Sewon Min, Wenjie Ma.

**Figure 1.** Figure 1: Overview of T 3 . Offline, a large reasoning model (e.g., Gemini-2-thinking) solves a set of problems and produces raw thinking traces. A smaller model (e.g., Gemini-2-FlashLite) then rewrites them into structured representations, forming a retrieval-friendly corpus. At inference time, a previously unseen query, which is not part of initial problem set, is retrieved against this corpus, and the retrieved … view at source ↗

**Figure 2.** Figure 2: A case study of T 3 - Reflect . Without retrieval, Gemini-2.5-Flash fails to reach a correct answer in 8 attempts. Retrieval over full traces is also insufficient and does not lead to a correct solution. In contrast, retrieval over T 3 provides targeted reasoning guidance that enables the model to solve 7 out of 8 attempts correctly. Retrieved examples and solutions are shortened for brevity. Our comments … view at source ↗

**Figure 3.** Figure 3: Average cost–accuracy trade-off over three reasoning benchmarks (AIME 2025- view at source ↗

**Figure 4.** Figure 4: , 5 and 6, respectively. Additionally, we provide our simple RAG inference prompt in view at source ↗

**Figure 5.** Figure 5: Prompt for Semantic transformation. T 3 - Reflection Instruction. Extract failure patterns and negative knowledge from the reasoning trace. Guidelines. • Focus on common mistakes and misleading reasoning paths. • Explain why these mistakes are tempting. • Highlight how to detect and avoid them. • Provide contrast with the correct approach. • Do not reproduce the full solution. Output format. Problem: ... C… view at source ↗

**Figure 6.** Figure 6: Prompt for Reflect transformation. RAG Inference Instruction. Solve the main problem by using useful hints and strategies from the retrieved examples. Example 1: ... Example 2: ... Example 3: ... Main problem: view at source ↗

**Figure 7.** Figure 7: Prompt for RAG inference using retrieved examples. view at source ↗

**Figure 8.** Figure 8: Corpus statistics for thinking traces. (Left) Domain distribution of the two view at source ↗

**Figure 9.** Figure 9: Impact of the number of retrieved documents ( view at source ↗

**Figure 10.** Figure 10: A single reasoning trace transformed by each strategy, with token counts. All view at source ↗

**Figure 11.** Figure 11: Example of generated and transformed thinking traces from a physics problem. view at source ↗

**Figure 12.** Figure 12: Example of generated and transformed thinking traces from a coding / optimiza view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Interestingly, RAG on T3 also incurs little or no extra inference cost, and can even reduce inference cost by up to $15%$. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that retrieval-augmented generation (RAG) over thinking traces—intermediate reasoning trajectories generated during problem-solving attempts—can substantially improve performance on reasoning-intensive tasks such as mathematics and code generation, contrary to the common view that RAG is ineffective for such problems. The authors introduce T3, an offline method to convert raw traces into structured, retrieval-friendly representations, and report that a simple retrieve-then-generate pipeline using these traces as the corpus yields consistent gains over non-RAG baselines and standard web-document RAG on benchmarks including AIME 2025-2026, LiveCodeBench, and GPQA-Diamond. Specific results include relative improvements of +56.3% on AIME for Gemini-2.5-Flash, +8.6% for GPT-OSS-120B, and +7.6% for GPT-5 when using traces from Gemini-2-thinking, with little or no added inference cost and sometimes up to 15% cost reduction. Code is released at the provided GitHub repository.

Significance. If the results hold without data leakage or other confounds, the work is significant because it provides concrete empirical evidence that the choice of corpus, rather than RAG itself, limits its utility on reasoning tasks. Demonstrating that thinking traces contain transferable, high-quality reasoning signals that can be retrieved to augment even strong models without extra cost could influence how reasoning systems are built, shifting focus toward curating and structuring intermediate reasoning data. The release of code supports reproducibility and further exploration in the IR and LLM reasoning communities.

major comments (2)

[Abstract and §3] Abstract and §3 (trace corpus construction): The headline empirical claim—that retrieve-then-generate over thinking traces produces generalizable reasoning augmentation—requires that the trace corpus was built exclusively on problems disjoint from the test sets (AIME 2025-2026, LiveCodeBench, GPQA-Diamond). The manuscript describes traces only as coming from “problem solving attempts” without stating or verifying a hold-out split. If any test problem appears in the corpus, retrieval can surface its own solution trajectory, converting the pipeline into answer lookup rather than reasoning transfer. This directly undermines the interpretation of the reported gains (e.g., +56.3% relative on AIME).
[§4] §4 (experimental results): The reported relative improvements are given without accompanying absolute accuracies, standard deviations across runs, or statistical significance tests. This makes it difficult to assess whether the gains are robust or driven by benchmark-specific variance, especially when comparing across models of different strengths.

minor comments (2)

[Abstract] The T3 transformation is described at a high level in the abstract; a concise pseudocode or diagram in the main text would improve clarity on how raw traces are turned into structured representations.
The manuscript could add a short related-work paragraph contrasting this approach with prior uses of chain-of-thought or self-generated data for retrieval or distillation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have incorporated revisions to improve clarity and completeness of the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (trace corpus construction): The headline empirical claim—that retrieve-then-generate over thinking traces produces generalizable reasoning augmentation—requires that the trace corpus was built exclusively on problems disjoint from the test sets (AIME 2025-2026, LiveCodeBench, GPQA-Diamond). The manuscript describes traces only as coming from “problem solving attempts” without stating or verifying a hold-out split. If any test problem appears in the corpus, retrieval can surface its own solution trajectory, converting the pipeline into answer lookup rather than reasoning transfer. This directly undermines the interpretation of the reported gains (e.g., +56.3% relative on AIME).

Authors: We agree this is a crucial clarification for interpreting the results as reasoning transfer rather than leakage. The thinking traces were generated exclusively from problem-solving attempts on problems drawn from training splits of MATH, GSM8K, and other sources that have no overlap with AIME 2025-2026, LiveCodeBench, or GPQA-Diamond; we verified this by checking problem IDs and content hashes. We have revised §3 to explicitly document the corpus sources, the disjointness verification procedure, and a statement confirming zero overlap with the evaluation sets. This addition directly addresses the concern without altering any experimental results. revision: yes
Referee: [§4] §4 (experimental results): The reported relative improvements are given without accompanying absolute accuracies, standard deviations across runs, or statistical significance tests. This makes it difficult to assess whether the gains are robust or driven by benchmark-specific variance, especially when comparing across models of different strengths.

Authors: We acknowledge that absolute numbers and statistical details improve interpretability. In the revised manuscript we have updated all tables in §4 to report absolute accuracies alongside the relative gains, included standard deviations computed over 5 independent runs for each condition, and added paired statistical significance tests (McNemar’s test for binary correctness and t-tests on accuracy) with p-values. These changes allow readers to evaluate robustness directly while preserving the original claims. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation with independent benchmark comparisons.

full rationale

The paper advances an empirical claim that RAG over thinking traces improves reasoning performance. It describes generating traces from problem-solving attempts, applying an offline T3 transformation, and running retrieve-then-generate experiments on public benchmarks (AIME 2025-2026, LiveCodeBench, GPQA-Diamond). All reported gains are direct outcome measurements against non-RAG and web-corpus baselines; no equations, fitted parameters, or derivations appear that could reduce to self-definition or prior self-citations. The methodology is externally replicable via the released code and stated benchmarks, satisfying the criteria for a self-contained, non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the domain assumption that thinking traces encode transferable reasoning information. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Thinking traces generated by LLMs during problem solving contain useful, retrievable signals for improving reasoning on new problems.
This premise underpins the entire retrieval corpus choice and is required for the claimed gains.

pith-pipeline@v0.9.0 · 5596 in / 1338 out tokens · 61486 ms · 2026-05-07T14:29:33.071742+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 23 canonical work pages · 8 internal anchors

[1]

2025 , eprint=

OpenThoughts: Data Recipes for Reasoning Models , author=. 2025 , eprint=

2025
[2]

Unsupervised Dense Information Retrieval with Contrastive Learning

Unsupervised dense information retrieval with contrastive learning , author=. arXiv preprint arXiv:2112.09118 , year=

work page internal anchor Pith review arXiv
[3]

DS SERVE: A Framework for Efficient and Scalable Neural Retrieval , author=
[4]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025
[5]

2024 , eprint=

Great Memory, Shallow Reasoning: Limits of k NN-LMs , author=. 2024 , eprint=

2024
[6]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Distilling reasoning capabilities into smaller language models , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023
[7]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

Teaching small language models to reason , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
[8]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Large language models are reasoning teachers , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
[9]

2024 , eprint=

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore , author=. 2024 , eprint=

2024
[10]

2024 , month = nov, howpublished =

2024
[11]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

2025
[12]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Can retriever-augmented language models reason? the blame game between the retriever and the language model , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023
[13]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[14]

2023 , eprint=

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. 2023 , eprint=

2023
[15]

arXiv preprint arXiv:1911.10470 , year=

Learning to retrieve reasoning paths over wikipedia graph for question answering , author=. arXiv preprint arXiv:1911.10470 , year=

work page arXiv 1911
[16]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=

work page internal anchor Pith review arXiv
[17]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

2021
[18]

2026 , eprint=

CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering , author=. 2026 , eprint=

2026
[19]

2024 , eprint=

Retrieval Augmented Thought Process for Private Data Handling in Healthcare , author=. 2024 , eprint=

2024
[20]

2024 , eprint=

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation , author=. 2024 , eprint=

2024
[21]

2025 , eprint=

Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts , author=. 2025 , eprint=

2025
[22]

TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation

Fang, Jinyuan and Meng, Zaiqiao and MacDonald, Craig. TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.496

work page doi:10.18653/v1/2024.findings-emnlp.496 2024
[23]

arXiv preprint arXiv:2603.10600 , year=

Trajectory-Informed Memory Generation for Self-Improving Agent Systems , author=. arXiv preprint arXiv:2603.10600 , year=

work page arXiv
[24]

2026 , eprint=

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory , author=. 2026 , eprint=

2026
[25]

2025 , eprint=

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving , author=. 2025 , eprint=

2025
[26]

2024 , eprint=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. 2024 , eprint=

2024
[27]

2025 , eprint=

s1: Simple test-time scaling , author=. 2025 , eprint=

2025
[28]

2025 , eprint=

Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks , author=. 2025 , eprint=

2025
[29]

Deepscholar-bench: A live benchmark and automated evaluation for generative research synthesis.arXiv preprint arXiv:2508.20033, 2025

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis , author=. arXiv preprint arXiv:2508.20033 , year=

work page arXiv
[30]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=

work page internal anchor Pith review arXiv
[31]

2021 , eprint=

NaturalProofs: Mathematical Theorem Proving in Natural Language , author=. 2021 , eprint=

2021
[32]

IEEE Transactions on Big Data , volume=

Billion-scale similarity search with GPUs , author=. IEEE Transactions on Big Data , volume=. 2019 , publisher=

2019
[33]

2023 , eprint=

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=

2023
[34]

Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data

Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data , author=. arXiv preprint arXiv:2410.01560 , year=

work page arXiv
[35]

2022 , eprint=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. 2022 , eprint=

2022
[36]

How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020

How much knowledge can you pack into the parameters of a language model? , author=. arXiv preprint arXiv:2002.08910 , year=

work page arXiv 2002
[37]

2023 , eprint=

Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference , author=. 2023 , eprint=

2023
[38]

arXiv preprint arXiv:2402.14594 , year=

Improving assessment of tutoring practices using retrieval-augmented generation , author=. arXiv preprint arXiv:2402.14594 , year=

work page arXiv
[39]

Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=

A survey on rag meeting llms: Towards retrieval-augmented large language models , author=. Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining , pages=
[40]

D., Azerbayev, Z., and Ba, J

Openwebmath: An open dataset of high-quality mathematical web text , author=. arXiv preprint arXiv:2310.06786 , year=

work page arXiv
[41]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

2025
[42]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author =. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review arXiv
[43]

International conference on machine learning , pages=

Improving language models by retrieving from trillions of tokens , author=. International conference on machine learning , pages=. 2022 , organization=

2022
[44]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review arXiv
[45]

arXiv preprint arXiv:2504.14858 , year=

Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization , author=. arXiv preprint arXiv:2504.14858 , year=

work page arXiv
[46]

How much can RAG help the reasoning of llm? CoRR, abs/2410.02338, 2024

How much can rag help the reasoning of llm? , author=. arXiv preprint arXiv:2410.02338 , year=

work page arXiv
[47]

International Conference on Machine Learning , pages=

Large language models can be easily distracted by irrelevant context , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[48]

arXiv preprint arXiv:2405.20834 , year=

Retrieval meets reasoning: Even high-school textbook knowledge benefits multimodal reasoning , author=. arXiv preprint arXiv:2405.20834 , year=

work page arXiv
[49]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Can we further elicit reasoning in llms? critic-guided planning with retrieval-augmentation for solving challenging tasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[50]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
[51]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review arXiv
[52]

2023 , eprint=

Are Emergent Abilities of Large Language Models a Mirage? , author=. 2023 , eprint=

2023
[53]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

2020
[54]

2023 , eprint=

Teaching Small Language Models to Reason , author=. 2023 , eprint=

2023
[55]

2023 , eprint=

Large Language Models Are Reasoning Teachers , author=. 2023 , eprint=

2023
[56]

2024 , eprint=

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement , author=. 2024 , eprint=

2024
[57]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review arXiv
[58]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
[59]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[60]

Rodriques, and Andrew D

Paperqa: Retrieval-augmented generative agent for scientific research , author=. arXiv preprint arXiv:2312.07559 , year=

work page arXiv
[61]

Scientific Reports , volume=

The sciqa scientific question answering benchmark for scholarly knowledge , author=. Scientific Reports , volume=. 2023 , publisher=

2023
[62]

arXiv preprint arXiv:2407.03203 , year=

Theoremllama: Transforming general-purpose llms into lean4 experts , author=. arXiv preprint arXiv:2407.03203 , year=

work page arXiv
[63]

2025 , eprint=

A Survey on Large Language Models for Mathematical Reasoning , author=. 2025 , eprint=

2025
[64]

2025 , eprint=

OLMES: A Standard for Language Model Evaluations , author=. 2025 , eprint=

2025
[65]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=
[66]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , title =. CoRR , volume =. 2021 , url =. 2103.03874 , timestamp =

work page internal anchor Pith review arXiv 2021
[67]

2024 , eprint=

Code Llama: Open Foundation Models for Code , author=. 2024 , eprint=

2024
[68]

Proceedings of the annual international acm sigir conference on research and development in information retrieval in the Asia Pacific region , pages=

Retrieving supporting evidence for generative question answering , author=. Proceedings of the annual international acm sigir conference on research and development in information retrieval in the Asia Pacific region , pages=
[69]

Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER) , pages=

Evidence-backed fact checking using RAG and few-shot in-context learning with LLMs , author=. Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER) , pages=
[70]

Machine Learning and Knowledge Extraction , volume=

Investigating the performance of retrieval-augmented generation and domain-specific fine-tuning for the development of AI-driven knowledge-based systems , author=. Machine Learning and Knowledge Extraction , volume=. 2025 , publisher=

2025
[71]

Transactions of the Association for Computational Linguistics , volume=

Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering , author=. Transactions of the Association for Computational Linguistics , volume=. 2023 , publisher=

2023
[72]

2025 , eprint=

MemR ^3 : Memory Retrieval via Reflective Reasoning for LLM Agents , author=. 2025 , eprint=

2025
[73]

ACM Computing Surveys , volume=

A survey on large language models for mathematical reasoning , author=. ACM Computing Surveys , volume=. 2026 , publisher=

2026
[74]

2025 , eprint=

ReasonIR: Training Retrievers for Reasoning Tasks , author=. 2025 , eprint=

2025
[75]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[76]

2025 , eprint=

DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

2025