RAG over Thinking Traces Can Improve Reasoning Tasks
Pith reviewed 2026-05-07 14:29 UTC · model grok-4.3
The pith
Retrieving thinking traces from problem-solving attempts improves reasoning performance on math and code benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Retrieving from a corpus of thinking traces generated during problem-solving attempts enables a simple retrieve-then-generate pipeline to improve reasoning performance across benchmarks including AIME 2025-2026, LiveCodeBench, and GPQA-Diamond. The approach outperforms both non-RAG methods and retrieval over web documents, with further gains from the T3 offline transformation that produces structured representations of the traces. Relative improvements reach 56.3 percent on AIME when using traces from Gemini-2-thinking with Gemini-2.5-Flash, and the method works even when the trace-generating model differs from the one answering new queries.
What carries the argument
Thinking traces, the intermediate reasoning trajectories produced while attempting to solve problems, used directly as the retrieval corpus in a RAG pipeline, together with the T3 offline method that converts raw traces into structured, compact, and diagnostic representations for improved matching and usability.
If this is right
- RAG over thinking traces outperforms retrieval from standard web corpora on reasoning benchmarks.
- Gains persist even when traces come from an earlier model and are applied to more recent models.
- The T3-structured version of the corpus can reduce inference cost by up to 15 percent while raising accuracy.
- Consistent improvements appear across model scales and across math, code, and science benchmarks.
Where Pith is reading between the lines
- If traces carry reusable reasoning patterns, then curating large shared libraries of high-quality thinking traces could become a practical way to augment future models without retraining.
- The results suggest that external retrieval of intermediate steps may sometimes be cheaper and more reliable than forcing a model to regenerate every reasoning path from scratch.
- This technique could be tested on domains outside math and code, such as scientific hypothesis generation, where intermediate reasoning steps are also recorded.
Load-bearing premise
Thinking traces generated during problem-solving attempts contain generalizable, high-quality reasoning signals that transfer usefully to new problems and different models without introducing systematic errors or biases from the trace-generation process itself.
What would settle it
Running the same retrieve-then-generate experiments on AIME or LiveCodeBench but replacing the thinking-traces corpus with either randomly ordered traces or traces drawn from unrelated problem domains, and observing whether performance gains disappear or reverse.
Figures
read the original abstract
Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME 2025-2026, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that retrieval-augmented generation (RAG) over thinking traces—intermediate reasoning trajectories generated during problem-solving attempts—can substantially improve performance on reasoning-intensive tasks such as mathematics and code generation, contrary to the common view that RAG is ineffective for such problems. The authors introduce T3, an offline method to convert raw traces into structured, retrieval-friendly representations, and report that a simple retrieve-then-generate pipeline using these traces as the corpus yields consistent gains over non-RAG baselines and standard web-document RAG on benchmarks including AIME 2025-2026, LiveCodeBench, and GPQA-Diamond. Specific results include relative improvements of +56.3% on AIME for Gemini-2.5-Flash, +8.6% for GPT-OSS-120B, and +7.6% for GPT-5 when using traces from Gemini-2-thinking, with little or no added inference cost and sometimes up to 15% cost reduction. Code is released at the provided GitHub repository.
Significance. If the results hold without data leakage or other confounds, the work is significant because it provides concrete empirical evidence that the choice of corpus, rather than RAG itself, limits its utility on reasoning tasks. Demonstrating that thinking traces contain transferable, high-quality reasoning signals that can be retrieved to augment even strong models without extra cost could influence how reasoning systems are built, shifting focus toward curating and structuring intermediate reasoning data. The release of code supports reproducibility and further exploration in the IR and LLM reasoning communities.
major comments (2)
- [Abstract and §3] Abstract and §3 (trace corpus construction): The headline empirical claim—that retrieve-then-generate over thinking traces produces generalizable reasoning augmentation—requires that the trace corpus was built exclusively on problems disjoint from the test sets (AIME 2025-2026, LiveCodeBench, GPQA-Diamond). The manuscript describes traces only as coming from “problem solving attempts” without stating or verifying a hold-out split. If any test problem appears in the corpus, retrieval can surface its own solution trajectory, converting the pipeline into answer lookup rather than reasoning transfer. This directly undermines the interpretation of the reported gains (e.g., +56.3% relative on AIME).
- [§4] §4 (experimental results): The reported relative improvements are given without accompanying absolute accuracies, standard deviations across runs, or statistical significance tests. This makes it difficult to assess whether the gains are robust or driven by benchmark-specific variance, especially when comparing across models of different strengths.
minor comments (2)
- [Abstract] The T3 transformation is described at a high level in the abstract; a concise pseudocode or diagram in the main text would improve clarity on how raw traces are turned into structured representations.
- The manuscript could add a short related-work paragraph contrasting this approach with prior uses of chain-of-thought or self-generated data for retrieval or distillation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have incorporated revisions to improve clarity and completeness of the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (trace corpus construction): The headline empirical claim—that retrieve-then-generate over thinking traces produces generalizable reasoning augmentation—requires that the trace corpus was built exclusively on problems disjoint from the test sets (AIME 2025-2026, LiveCodeBench, GPQA-Diamond). The manuscript describes traces only as coming from “problem solving attempts” without stating or verifying a hold-out split. If any test problem appears in the corpus, retrieval can surface its own solution trajectory, converting the pipeline into answer lookup rather than reasoning transfer. This directly undermines the interpretation of the reported gains (e.g., +56.3% relative on AIME).
Authors: We agree this is a crucial clarification for interpreting the results as reasoning transfer rather than leakage. The thinking traces were generated exclusively from problem-solving attempts on problems drawn from training splits of MATH, GSM8K, and other sources that have no overlap with AIME 2025-2026, LiveCodeBench, or GPQA-Diamond; we verified this by checking problem IDs and content hashes. We have revised §3 to explicitly document the corpus sources, the disjointness verification procedure, and a statement confirming zero overlap with the evaluation sets. This addition directly addresses the concern without altering any experimental results. revision: yes
-
Referee: [§4] §4 (experimental results): The reported relative improvements are given without accompanying absolute accuracies, standard deviations across runs, or statistical significance tests. This makes it difficult to assess whether the gains are robust or driven by benchmark-specific variance, especially when comparing across models of different strengths.
Authors: We acknowledge that absolute numbers and statistical details improve interpretability. In the revised manuscript we have updated all tables in §4 to report absolute accuracies alongside the relative gains, included standard deviations computed over 5 independent runs for each condition, and added paired statistical significance tests (McNemar’s test for binary correctness and t-tests on accuracy) with p-values. These changes allow readers to evaluate robustness directly while preserving the original claims. revision: yes
Circularity Check
No circularity; purely empirical evaluation with independent benchmark comparisons.
full rationale
The paper advances an empirical claim that RAG over thinking traces improves reasoning performance. It describes generating traces from problem-solving attempts, applying an offline T3 transformation, and running retrieve-then-generate experiments on public benchmarks (AIME 2025-2026, LiveCodeBench, GPQA-Diamond). All reported gains are direct outcome measurements against non-RAG and web-corpus baselines; no equations, fitted parameters, or derivations appear that could reduce to self-definition or prior self-citations. The methodology is externally replicable via the released code and stated benchmarks, satisfying the criteria for a self-contained, non-circular result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Thinking traces generated by LLMs during problem solving contain useful, retrievable signals for improving reasoning on new problems.
Forward citations
Cited by 1 Pith paper
-
Natural Language Query to Configuration for Retrieval Agents
BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.