RAG over Thinking Traces Can Improve Reasoning Tasks

Matei Zaharia; Negar Arabzadeh; Sewon Min; Wenjie Ma

arxiv: 2605.03344 · v2 · pith:BNMDKKCEnew · submitted 2026-05-05 · 💻 cs.IR · cs.AI· cs.CL

RAG over Thinking Traces Can Improve Reasoning Tasks

Negar Arabzadeh , Wenjie Ma , Sewon Min , Matei Zaharia This is my paper

Pith reviewed 2026-05-07 14:29 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords retrieval-augmented generationthinking tracesreasoning tasksmath benchmarkscode generationRAGintermediate reasoningT3 method

0 comments

The pith

Retrieving thinking traces from problem-solving attempts improves reasoning performance on math and code benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the common belief that retrieval-augmented generation provides limited value for reasoning-heavy tasks like mathematics and code generation. It shows that the key limitation is the choice of corpus rather than RAG itself, and demonstrates that intermediate thinking trajectories generated while attempting problems serve as an effective retrieval source. A retrieve-then-generate approach using these traces produces consistent gains across strong models on benchmarks such as AIME, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval from standard web corpora. The authors further introduce an offline method called T3 that restructures the traces into more retrieval-friendly forms, unlocking additional performance and sometimes lower inference costs.

Core claim

Retrieving from a corpus of thinking traces generated during problem-solving attempts enables a simple retrieve-then-generate pipeline to improve reasoning performance across benchmarks including AIME 2025-2026, LiveCodeBench, and GPQA-Diamond. The approach outperforms both non-RAG methods and retrieval over web documents, with further gains from the T3 offline transformation that produces structured representations of the traces. Relative improvements reach 56.3 percent on AIME when using traces from Gemini-2-thinking with Gemini-2.5-Flash, and the method works even when the trace-generating model differs from the one answering new queries.

What carries the argument

Thinking traces, the intermediate reasoning trajectories produced while attempting to solve problems, used directly as the retrieval corpus in a RAG pipeline, together with the T3 offline method that converts raw traces into structured, compact, and diagnostic representations for improved matching and usability.

If this is right

RAG over thinking traces outperforms retrieval from standard web corpora on reasoning benchmarks.
Gains persist even when traces come from an earlier model and are applied to more recent models.
The T3-structured version of the corpus can reduce inference cost by up to 15 percent while raising accuracy.
Consistent improvements appear across model scales and across math, code, and science benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If traces carry reusable reasoning patterns, then curating large shared libraries of high-quality thinking traces could become a practical way to augment future models without retraining.
The results suggest that external retrieval of intermediate steps may sometimes be cheaper and more reliable than forcing a model to regenerate every reasoning path from scratch.
This technique could be tested on domains outside math and code, such as scientific hypothesis generation, where intermediate reasoning steps are also recorded.

Load-bearing premise

Thinking traces generated during problem-solving attempts contain generalizable, high-quality reasoning signals that transfer usefully to new problems and different models without introducing systematic errors or biases from the trace-generation process itself.

What would settle it

Running the same retrieve-then-generate experiments on AIME or LiveCodeBench but replacing the thinking-traces corpus with either randomly ordered traces or traces drawn from unrelated problem domains, and observing whether performance gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.03344 by Matei Zaharia, Negar Arabzadeh, Sewon Min, Wenjie Ma.

**Figure 1.** Figure 1: Overview of T 3 . Offline, a large reasoning model (e.g., Gemini-2-thinking) solves a set of problems and produces raw thinking traces. A smaller model (e.g., Gemini-2-FlashLite) then rewrites them into structured representations, forming a retrieval-friendly corpus. At inference time, a previously unseen query, which is not part of initial problem set, is retrieved against this corpus, and the retrieved … view at source ↗

**Figure 2.** Figure 2: A case study of T 3 - Reflect . Without retrieval, Gemini-2.5-Flash fails to reach a correct answer in 8 attempts. Retrieval over full traces is also insufficient and does not lead to a correct solution. In contrast, retrieval over T 3 provides targeted reasoning guidance that enables the model to solve 7 out of 8 attempts correctly. Retrieved examples and solutions are shortened for brevity. Our comments … view at source ↗

**Figure 3.** Figure 3: Average cost–accuracy trade-off over three reasoning benchmarks (AIME 2025- view at source ↗

**Figure 4.** Figure 4: , 5 and 6, respectively. Additionally, we provide our simple RAG inference prompt in view at source ↗

**Figure 5.** Figure 5: Prompt for Semantic transformation. T 3 - Reflection Instruction. Extract failure patterns and negative knowledge from the reasoning trace. Guidelines. • Focus on common mistakes and misleading reasoning paths. • Explain why these mistakes are tempting. • Highlight how to detect and avoid them. • Provide contrast with the correct approach. • Do not reproduce the full solution. Output format. Problem: ... C… view at source ↗

**Figure 6.** Figure 6: Prompt for Reflect transformation. RAG Inference Instruction. Solve the main problem by using useful hints and strategies from the retrieved examples. Example 1: ... Example 2: ... Example 3: ... Main problem: view at source ↗

**Figure 7.** Figure 7: Prompt for RAG inference using retrieved examples. view at source ↗

**Figure 8.** Figure 8: Corpus statistics for thinking traces. (Left) Domain distribution of the two view at source ↗

**Figure 9.** Figure 9: Impact of the number of retrieved documents ( view at source ↗

**Figure 10.** Figure 10: A single reasoning trace transformed by each strategy, with token counts. All view at source ↗

**Figure 11.** Figure 11: Example of generated and transformed thinking traces from a physics problem. view at source ↗

**Figure 12.** Figure 12: Example of generated and transformed thinking traces from a coding / optimiza view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME 2025-2026, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that retrieval-augmented generation (RAG) over thinking traces—intermediate reasoning trajectories generated during problem-solving attempts—can substantially improve performance on reasoning-intensive tasks such as mathematics and code generation, contrary to the common view that RAG is ineffective for such problems. The authors introduce T3, an offline method to convert raw traces into structured, retrieval-friendly representations, and report that a simple retrieve-then-generate pipeline using these traces as the corpus yields consistent gains over non-RAG baselines and standard web-document RAG on benchmarks including AIME 2025-2026, LiveCodeBench, and GPQA-Diamond. Specific results include relative improvements of +56.3% on AIME for Gemini-2.5-Flash, +8.6% for GPT-OSS-120B, and +7.6% for GPT-5 when using traces from Gemini-2-thinking, with little or no added inference cost and sometimes up to 15% cost reduction. Code is released at the provided GitHub repository.

Significance. If the results hold without data leakage or other confounds, the work is significant because it provides concrete empirical evidence that the choice of corpus, rather than RAG itself, limits its utility on reasoning tasks. Demonstrating that thinking traces contain transferable, high-quality reasoning signals that can be retrieved to augment even strong models without extra cost could influence how reasoning systems are built, shifting focus toward curating and structuring intermediate reasoning data. The release of code supports reproducibility and further exploration in the IR and LLM reasoning communities.

major comments (2)

[Abstract and §3] Abstract and §3 (trace corpus construction): The headline empirical claim—that retrieve-then-generate over thinking traces produces generalizable reasoning augmentation—requires that the trace corpus was built exclusively on problems disjoint from the test sets (AIME 2025-2026, LiveCodeBench, GPQA-Diamond). The manuscript describes traces only as coming from “problem solving attempts” without stating or verifying a hold-out split. If any test problem appears in the corpus, retrieval can surface its own solution trajectory, converting the pipeline into answer lookup rather than reasoning transfer. This directly undermines the interpretation of the reported gains (e.g., +56.3% relative on AIME).
[§4] §4 (experimental results): The reported relative improvements are given without accompanying absolute accuracies, standard deviations across runs, or statistical significance tests. This makes it difficult to assess whether the gains are robust or driven by benchmark-specific variance, especially when comparing across models of different strengths.

minor comments (2)

[Abstract] The T3 transformation is described at a high level in the abstract; a concise pseudocode or diagram in the main text would improve clarity on how raw traces are turned into structured representations.
The manuscript could add a short related-work paragraph contrasting this approach with prior uses of chain-of-thought or self-generated data for retrieval or distillation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have incorporated revisions to improve clarity and completeness of the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (trace corpus construction): The headline empirical claim—that retrieve-then-generate over thinking traces produces generalizable reasoning augmentation—requires that the trace corpus was built exclusively on problems disjoint from the test sets (AIME 2025-2026, LiveCodeBench, GPQA-Diamond). The manuscript describes traces only as coming from “problem solving attempts” without stating or verifying a hold-out split. If any test problem appears in the corpus, retrieval can surface its own solution trajectory, converting the pipeline into answer lookup rather than reasoning transfer. This directly undermines the interpretation of the reported gains (e.g., +56.3% relative on AIME).

Authors: We agree this is a crucial clarification for interpreting the results as reasoning transfer rather than leakage. The thinking traces were generated exclusively from problem-solving attempts on problems drawn from training splits of MATH, GSM8K, and other sources that have no overlap with AIME 2025-2026, LiveCodeBench, or GPQA-Diamond; we verified this by checking problem IDs and content hashes. We have revised §3 to explicitly document the corpus sources, the disjointness verification procedure, and a statement confirming zero overlap with the evaluation sets. This addition directly addresses the concern without altering any experimental results. revision: yes
Referee: [§4] §4 (experimental results): The reported relative improvements are given without accompanying absolute accuracies, standard deviations across runs, or statistical significance tests. This makes it difficult to assess whether the gains are robust or driven by benchmark-specific variance, especially when comparing across models of different strengths.

Authors: We acknowledge that absolute numbers and statistical details improve interpretability. In the revised manuscript we have updated all tables in §4 to report absolute accuracies alongside the relative gains, included standard deviations computed over 5 independent runs for each condition, and added paired statistical significance tests (McNemar’s test for binary correctness and t-tests on accuracy) with p-values. These changes allow readers to evaluate robustness directly while preserving the original claims. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation with independent benchmark comparisons.

full rationale

The paper advances an empirical claim that RAG over thinking traces improves reasoning performance. It describes generating traces from problem-solving attempts, applying an offline T3 transformation, and running retrieve-then-generate experiments on public benchmarks (AIME 2025-2026, LiveCodeBench, GPQA-Diamond). All reported gains are direct outcome measurements against non-RAG and web-corpus baselines; no equations, fitted parameters, or derivations appear that could reduce to self-definition or prior self-citations. The methodology is externally replicable via the released code and stated benchmarks, satisfying the criteria for a self-contained, non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the domain assumption that thinking traces encode transferable reasoning information. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Thinking traces generated by LLMs during problem solving contain useful, retrievable signals for improving reasoning on new problems.
This premise underpins the entire retrieval corpus choice and is required for the claimed gains.

pith-pipeline@v0.9.0 · 5596 in / 1338 out tokens · 61486 ms · 2026-05-07T14:29:33.071742+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Natural Language Query to Configuration for Retrieval Agents
cs.AI 2026-05 unverdicted novelty 6.0

BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.