TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning
Pith reviewed 2026-05-22 13:35 UTC · model grok-4.3
The pith
Guiding LLM reinforcement learning rollouts with MCTS-derived templates raises high-quality reasoning trajectory rates and cuts ineffective exploration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TemplateRL builds an explicit problem-solving template library through MCTS on a small seed set and injects this library as guidance during RL policy optimization. By forcing rollout generation to align with the discovered template structures, the method increases the hit rate of high-quality reasoning trajectories, reduces ineffective exploration, stabilizes training dynamics, and improves sampling efficiency. The library itself remains interpretable and editable, and the framework supports continuous online updates to the templates during both training and inference. Experiments report that these changes produce 99 percent higher performance than GRPO on AIME and 41 percent higher on AMC,,
What carries the argument
The template library produced by MCTS on seed problems, which supplies explicit high-level structured guidance that shapes rollout generation and policy updates inside the RL loop.
If this is right
- Training dynamics become more stable, especially when starting from weaker base models.
- Sampling efficiency rises because rollouts are steered toward patterns already validated by the templates.
- The guidance remains editable, allowing human inspection or modification of the strategic patterns being reinforced.
- Online updates to the template library can be performed during both training and later inference without restarting the process.
- Performance gains appear on math benchmarks and extend across different problem domains.
Where Pith is reading between the lines
- The readable templates open a route for experts to inject domain knowledge by editing or adding entries rather than relying solely on reward signals.
- If the same template-extraction step is applied to other structured tasks such as code generation, the method could supply similar efficiency gains outside pure math reasoning.
- Because templates can be updated online, the approach may support long-running systems that accumulate and refine strategic patterns from ongoing interactions.
Load-bearing premise
Templates extracted from a small seed set encode general problem-solving strategies that transfer to new problems without narrowing the space of useful explorations or biasing the model toward the seed distribution.
What would settle it
Run TemplateRL and a plain GRPO baseline on a fresh set of held-out problems and measure whether the template-guided version still produces measurably higher rates of high-quality trajectories or higher final accuracy.
Figures
read the original abstract
Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning. However, existing RL methods like GRPO typically rely on unstructured self-sampling to fit scalar rewards, often producing inefficient rollouts that fail to capture transferable problem-solving strategies. To address this limitation, we propose **TemplateRL**, a structured template-guided RL framework that augments policy optimization with explicit template guidance. Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training. By guiding rollout generation to align with proven template structures, TemplateRL significantly improves high-quality trajectory hit rates while reducing ineffective exploration. This structure-guided design steers the policy toward validated strategic patterns, stabilizing training dynamics, and enhancing RL sampling efficiency. Notably, the explicit template library is interpretable, editable, and supports online updates-enabling continuous updates during both training and inference. Extensive experiments demonstrate that TemplateRL outperforms GRPO by 99% on AIME and 41% on AMC, with superior stability on weak models and remarkable cross-domain generalization, highlighting its potential for broader tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TemplateRL, a structured template-guided reinforcement learning framework for improving LLM reasoning. It first builds a library of problem-solving templates via MCTS on a small seed set, then integrates this explicit guidance into policy optimization to steer rollouts toward high-quality trajectories, reduce ineffective exploration, and stabilize training. Experiments report large gains over GRPO (99% on AIME, 41% on AMC) plus benefits in stability on weak models and cross-domain generalization; the template library is also presented as interpretable and online-editable.
Significance. If the central claims hold after addressing the requested clarifications, TemplateRL would offer a practical way to inject structured, human-interpretable guidance into RL for LLM reasoning, potentially improving sample efficiency and training stability over purely unstructured self-sampling methods such as GRPO. The explicit, editable nature of the templates is a notable strength that could support broader applicability and debugging.
major comments (3)
- [§3.2] §3.2 (Template Library Construction): The MCTS procedure on the seed set is described at a high level, but the reward formulation, template extraction criteria, and exact representation of templates (e.g., as sequences of reasoning steps or constraints) are not specified. This is load-bearing for the claim that the templates encode transferable high-level strategies rather than seed-specific patterns.
- [§5.1, Table 2] §5.1 and Table 2 (Performance Results): The reported 99% improvement on AIME and 41% on AMC are presented without rollout counts per problem, number of independent training runs, standard deviations, or statistical significance tests. In the absence of these controls it is impossible to rule out that the gains arise from variance, implicit narrowing of the rollout distribution, or overfitting to the seed-set distribution rather than genuine strategy transfer.
- [§4.1] §4.1 (Template-Guided Policy Optimization): The mechanism by which templates are injected into the RL objective (e.g., as an auxiliary loss, constrained sampling, or modified reward) is not formalized with equations or pseudocode. Without this formalization it is difficult to assess whether the method avoids the exploration-collapse risk highlighted in the stress-test note when templates are applied to held-out problems whose distribution differs from the seed set.
minor comments (2)
- [Abstract, §5.3] The abstract and §5.3 claim 'remarkable cross-domain generalization,' yet the experimental section does not explicitly list the held-out domains or quantify distribution shift between seed and test sets.
- [§3.3] Notation for the template library and its update rule is introduced without a clear table or diagram; a small illustrative example of a template before and after an online edit would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide the requested clarifications and additional empirical controls.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Template Library Construction): The MCTS procedure on the seed set is described at a high level, but the reward formulation, template extraction criteria, and exact representation of templates (e.g., as sequences of reasoning steps or constraints) are not specified. This is load-bearing for the claim that the templates encode transferable high-level strategies rather than seed-specific patterns.
Authors: We agree that the current description in §3.2 is high-level and would benefit from greater precision. In the revised manuscript we will expand this section to specify the MCTS reward (a combination of final-answer correctness and per-step verification), the extraction criteria (paths with success rate above a threshold and low variance on semantically similar problems), and the template representation (ordered sequences of high-level reasoning steps together with explicit constraint predicates). These additions will make explicit how the templates capture reusable strategic patterns rather than seed-specific artifacts. revision: yes
-
Referee: [§5.1, Table 2] §5.1 and Table 2 (Performance Results): The reported 99% improvement on AIME and 41% on AMC are presented without rollout counts per problem, number of independent training runs, standard deviations, or statistical significance tests. In the absence of these controls it is impossible to rule out that the gains arise from variance, implicit narrowing of the rollout distribution, or overfitting to the seed-set distribution rather than genuine strategy transfer.
Authors: We accept that the current presentation lacks the statistical safeguards needed to support the reported gains. We will revise §5.1 and Table 2 to report rollout counts per problem (16 for AIME, 8 for AMC), results aggregated over five independent training runs with distinct random seeds, standard deviations, and p-values from paired t-tests against GRPO. These additions will allow readers to assess whether the improvements exceed what could be explained by variance or seed-set overfitting. revision: yes
-
Referee: [§4.1] §4.1 (Template-Guided Policy Optimization): The mechanism by which templates are injected into the RL objective (e.g., as an auxiliary loss, constrained sampling, or modified reward) is not formalized with equations or pseudocode. Without this formalization it is difficult to assess whether the method avoids the exploration-collapse risk highlighted in the stress-test note when templates are applied to held-out problems whose distribution differs from the seed set.
Authors: We concur that an explicit formalization is required. The revised §4.1 will include the precise objective (the original GRPO loss augmented by a KL term that penalizes deviation from the template-constrained sampling distribution) together with pseudocode in the appendix. On the exploration-collapse concern, we note that our cross-domain experiments already apply seed-derived templates to out-of-distribution problems and observe no collapse; the online editability of the template library further permits adaptation. We will add a short discussion of this mitigation in the revised text. revision: yes
Circularity Check
No significant circularity; empirical method builds on external components
full rationale
The paper describes a method that first extracts templates via MCTS on a small seed set and then incorporates the resulting library as guidance within standard RL policy optimization (e.g., compared to GRPO). Reported gains on AIME and AMC are presented as experimental outcomes from benchmark evaluation rather than quantities derived by construction from the same fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes are shown reducing the central claims to the inputs; the template construction and RL integration remain independent steps whose validity can be checked against held-out data and external baselines. This keeps the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Templates extracted via MCTS on a small seed set capture generalizable problem-solving strategies.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training
-
IndisputableMonolith/Foundation/ArithmeticFromLogicLogicNat induction and recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 3.1 (Stability and positive-sample guarantee) ... grouping increases Ppos
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors
DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
-
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.
-
EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning
EvoCoT uses self-generated and verified CoT trajectories in a two-stage curriculum to let LLMs learn from initially unsolved hard problems in RLVR settings.
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
Reference graph
Works this paper leans on
-
[1]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
rstar-math: Small llms can master math reason- ing with self-evolved deep thinking.arXiv preprint arXiv:2501.04519. 9 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Understanding R1-Zero-Like Training: A Critical Perspective
Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai- Wei Chang, Michel Galley, and Jianfeng Gao. 2023. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.0225...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. 2024a. Measuring multimodal mathematical reason- ing with MATH-vision dataset. InThe Thirty-eight Conference on Neural Information Processing Sys- tems Datasets and Bench...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Structure-Guided RL Paradigm: Aug- menting policy optimization with ex- plicit structured templates significantly improves training efficiency, stability, and generalization over unstructured self- sampling
-
[5]
Transferable High-Level Strategic Pat- terns: Abstract reasoning templates ex- hibit remarkable cross-domain and cross- task transferability, enabling knowledge accumulation and flexible expert inter- vention through interpretable, editable structures
-
[6]
Effective Learning on Weaker Mod- els: Template-guided training compen- sates for limited model capacity, achiev- ing stable training where standard RL fre- quently collapses. Existing RL methods for LLM reasoning rely pri- marily on unstructured self-sampling to fit scalar re- wards, producing inefficient rollouts that fail to cap- ture transferable prob...
-
[7]
The probability that the entire batch ofGtra- jectories contains at least one positive sample equals P batch ≥1 = 1− mY h=1 (1−p h). In particular, if groups are symmetric (ph ≡ pgrp), then P batch ≥1 = 1−(1−p grp)m, which is strictly increasing in m and ap- proaches 1 as m→ ∞ for any fixed pgrp >0
-
[8]
Assume group contributions are independent with E[gh] = ¯g grp and Var[gh] = Σ grp (bounded)
Let gh denote the (vector-valued) gradient contribution from grouph (e.g., the average of per-trajectory contributions within the group). Assume group contributions are independent with E[gh] = ¯g grp and Var[gh] = Σ grp (bounded). Then the variance of the batch- averaged estimatorˆg= 1 m Pm h=1 gh satisfies Var[ˆg]⪯ 1 mΣgrp +o 1 m , i.e. the leading-orde...
-
[9]
Per-mini-group: the event that mini-group h contains at least one positive trajectory due to template transfer occurs with probability at least rh. If rh > p policy h (the per-group suc- cess probability under policy rollouts), tem- plate transfer strictly improves per-group suc- cess
-
[10]
Batch-level: the probability that the full batch has at least one positive sample is P template ≥1 = 1− mY h=1 (1−r h), which exceeds the policy-rollout probability 1−Qm h=1(1−p policy h ) whenever rh ≥p policy h for allhand strict for someh. Proof sketch. (1) and (2) follow from the same complement argument as in the proof of Propo- sition 3.1 (B.2). B.4...
-
[11]
Thus, for new queries q that are close to q′ (small dPCC(q,q ′)), the inequality in Lemma B.1 is likely to hold. This implies that retrieving high- quality templates from similar seed problems leads to transferred trajectories with positive advantage on the new problem. Consequently, the per-group template success probability rh is strictly higher than th...
work page 2008
-
[12]
DC (Divide and Conquer): Breaking down a complex reasoning problem into several smaller subproblems and progressively solving them to achieve the overall solution
-
[13]
CoT (Chain-of-Thought): Facilitating step-by-step reasoning by constructing a logical sequence of intermediate thoughts, where each step incrementally builds on the previous ones. Action DescriptionAbstract Figure 10: Thought Template Visualization. On the left, the seed dataset contains 3 questions with the same reasoning pattern, “DC→DC→CoT”, correspond...
work page 2023
-
[14]
Calculate differences:For each benchmark pair, compute the difference Di =X i − Yi where Xi represents TemplateRL per- formance and Yi represents baseline perfor- mance
-
[15]
Rank differences:Take absolute values |Di| and rank them from smallest to largest as Ri, with average ranks assigned for ties
-
[16]
Assign signs to ranks:For each difference Di, assign its sign to the corresponding rank: R′ i =sign(D i)·R i
-
[17]
Calculate rank sums:Compute positive and negative rank sums: W + = P Di>0 R′ i and W − =P Di<0 R′ i
-
[18]
Determine test statistic:The test statistic is W= min(W +, W −)
-
[19]
The problem involves ... final answer is 46
Calculate p-value:Derive the p-value from the distribution of test statisticW. We test the null hypothesis H0: no significant difference between methods against the alterna- tive hypothesis H1: significant difference exists. 22 Comparing TemplateRL with GRPO on Qwen2.5- Math-7B-Base across all benchmarks, we obtain a p-value of 0.0156. Using a significanc...
-
[20]
Given that the highest score is 98:
-
[21]
After removing the highest and lowest scores, there are 10 scores left: So, we have:
-
[22]
Solve for the sum of the highest and lowest scores:
-
[23]
We know the highest score is 98:
-
[24]
The problem involves … final answer is
Substitute into the equation for the sum of the highest and lowest scores: Thus, the lowest score is: The lowest score is . The problem involves … final answer is . The problem involves … final answer is . …(repeat) ∑12 i=1xi=12×82=984 12 ∑i=1 xi−x1−x12=984−x1−x12 11 ∑i=2 xi=10×84=840 984−x1−x12=840 x1+x12=984−840=144 x1=98 x1 98+x12=144 46 46 46 46 Let's...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.