pith. sign in

arxiv: 2505.15692 · v5 · pith:7TVFGYSUnew · submitted 2025-05-21 · 💻 cs.CL · cs.LG

TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning

Pith reviewed 2026-05-22 13:35 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords TemplateRLstructured template guidanceMCTS templatesLLM reasoningreinforcement learningtrajectory optimizationGRPO comparisonmath benchmarks
0
0 comments X

The pith

Guiding LLM reinforcement learning rollouts with MCTS-derived templates raises high-quality reasoning trajectory rates and cuts ineffective exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that current RL methods for LLM reasoning waste effort on unstructured sampling and miss transferable strategies. TemplateRL first extracts a library of problem-solving templates by running MCTS on a small seed collection, then feeds those templates back into the RL loop to shape how new rollouts are generated. This explicit structure steers the policy toward patterns that have already worked, raising the fraction of useful trajectories while lowering the amount of wasted computation. The result is claimed to be faster convergence, more stable training, and stronger final performance on math benchmarks, with the added benefit that the templates stay human-readable and can be edited or updated on the fly.

Core claim

TemplateRL builds an explicit problem-solving template library through MCTS on a small seed set and injects this library as guidance during RL policy optimization. By forcing rollout generation to align with the discovered template structures, the method increases the hit rate of high-quality reasoning trajectories, reduces ineffective exploration, stabilizes training dynamics, and improves sampling efficiency. The library itself remains interpretable and editable, and the framework supports continuous online updates to the templates during both training and inference. Experiments report that these changes produce 99 percent higher performance than GRPO on AIME and 41 percent higher on AMC,,

What carries the argument

The template library produced by MCTS on seed problems, which supplies explicit high-level structured guidance that shapes rollout generation and policy updates inside the RL loop.

If this is right

  • Training dynamics become more stable, especially when starting from weaker base models.
  • Sampling efficiency rises because rollouts are steered toward patterns already validated by the templates.
  • The guidance remains editable, allowing human inspection or modification of the strategic patterns being reinforced.
  • Online updates to the template library can be performed during both training and later inference without restarting the process.
  • Performance gains appear on math benchmarks and extend across different problem domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The readable templates open a route for experts to inject domain knowledge by editing or adding entries rather than relying solely on reward signals.
  • If the same template-extraction step is applied to other structured tasks such as code generation, the method could supply similar efficiency gains outside pure math reasoning.
  • Because templates can be updated online, the approach may support long-running systems that accumulate and refine strategic patterns from ongoing interactions.

Load-bearing premise

Templates extracted from a small seed set encode general problem-solving strategies that transfer to new problems without narrowing the space of useful explorations or biasing the model toward the seed distribution.

What would settle it

Run TemplateRL and a plain GRPO baseline on a fresh set of held-out problems and measure whether the template-guided version still produces measurably higher rates of high-quality trajectories or higher final accuracy.

Figures

Figures reproduced from arXiv: 2505.15692 by Chonghua Liao, Haoran Luo, Huazhe Xu, Jianhua Tao, Jinyang Wu, Ling Yang, Mingkuan Feng, Shuai Zhang, Zhengqi Wen.

Figure 1
Figure 1. Figure 1: Paradigm comparison: Teacher-Student Analogy for RL. Standard RL like GRPO (left) pro￾vides only sparse answer rewards, while TemplateRL (right) offers structured templates encoding problem￾solving thought patterns, enabling effective learning of both concrete steps and underlying strategic logic. 2024) directly optimize base models using auto￾matically computable reward signals. This en￾ables models to de… view at source ↗
Figure 2
Figure 2. Figure 2: Flowchart of TemplateRL. This framework consists of three components: (1) template construction (Section 3.1); (2) template-guided training (Section 3.2); and (3) optional template updates (Section 3.3). 3.1 Template Construction We first describe how to construct a template li￾brary, which guides RL training in Section 3.2. Previous work (Kahneman, 2011) reveals that hu￾mans solve complex reasoning tasks … view at source ↗
Figure 3
Figure 3. Figure 3: Structure of action-chain solution trajectories. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison across different model scales and architectures. For better visualization, MATH500 results are adjusted by subtracting 20 points for all models. 0 100 200 300 400 500 Steps 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 Training Rewards GRPO Ours (a) Qwen2.5-Math-7B-Base 0 25 50 75 100 125 150 175 200 Steps 0.0 0.1 0.2 0.3 0.4 Training Rewards GRPO Ours (b) Llama-3.2-3B-Base [PITH_FUL… view at source ↗
Figure 5
Figure 5. Figure 5: Training stability verification. We evaluate on Qwen2.5-Math-7B-Base and Llama-3.2-3B-Base. by Math-Verify. We train for 500 steps on 8 A100 GPUs. More details are provided in Appendix D. 4.2 Improved Performance Main Results [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-domain generalization verification. We provide OOD results on Qwen2.5-Math-7B-Base. MMLU-Pro (general knowledge). As shown in Fig￾ure 6, TemplateRL consistently outperforms GRPO across OOD tasks, with +6.1% performance gains on complex agentic scenarios (BALROG). This re￾veals that high-level template guidance effectively enhances model generalization to practical applica￾tions. Detailed results are … view at source ↗
Figure 7
Figure 7. Figure 7: Case Study. TemplateRL produces a more structured and interpretable reasoning chain with clear steps. 0 20 40 60 80 100 120 140 Steps 0.2 0.3 0.4 0.5 0.6 0.7 Training Rewards |g|=1 |g|=2 |g|=4 (a) Training Reward Curve 0 10 20 30 40 MATH500 AIME24 Minerva Math AMC Olympiad |g| = 1 |g| = 2 |g| = 4 (b) Evaluation Performance [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study. We provide results with dif￾ferent numbers of thought patterns (template guidance). pseudo-supervision. This continuous expansion ca￾pability helps for test-time scaling scenarios where models leverage accumulated knowledge from ear￾lier predictions to improve subsequent ones. Note that, while this may introduce some noise, prior work has validated the majority voting’s effective￾ness as ps… view at source ↗
Figure 9
Figure 9. Figure 9: An illustration of four phases in an iteration of MCTS for complex reasoning tasks. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Thought Template Visualization. On the left, the seed dataset contains 3 questions with the same [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The system prompt used for all experiments. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of GRPO and TemplateRL for a simple algorithm problem from the MATH dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of GRPO and TemplateRL for a difficult algorithm problem from the MATH dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
read the original abstract

Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning. However, existing RL methods like GRPO typically rely on unstructured self-sampling to fit scalar rewards, often producing inefficient rollouts that fail to capture transferable problem-solving strategies. To address this limitation, we propose **TemplateRL**, a structured template-guided RL framework that augments policy optimization with explicit template guidance. Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training. By guiding rollout generation to align with proven template structures, TemplateRL significantly improves high-quality trajectory hit rates while reducing ineffective exploration. This structure-guided design steers the policy toward validated strategic patterns, stabilizing training dynamics, and enhancing RL sampling efficiency. Notably, the explicit template library is interpretable, editable, and supports online updates-enabling continuous updates during both training and inference. Extensive experiments demonstrate that TemplateRL outperforms GRPO by 99% on AIME and 41% on AMC, with superior stability on weak models and remarkable cross-domain generalization, highlighting its potential for broader tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes TemplateRL, a structured template-guided reinforcement learning framework for improving LLM reasoning. It first builds a library of problem-solving templates via MCTS on a small seed set, then integrates this explicit guidance into policy optimization to steer rollouts toward high-quality trajectories, reduce ineffective exploration, and stabilize training. Experiments report large gains over GRPO (99% on AIME, 41% on AMC) plus benefits in stability on weak models and cross-domain generalization; the template library is also presented as interpretable and online-editable.

Significance. If the central claims hold after addressing the requested clarifications, TemplateRL would offer a practical way to inject structured, human-interpretable guidance into RL for LLM reasoning, potentially improving sample efficiency and training stability over purely unstructured self-sampling methods such as GRPO. The explicit, editable nature of the templates is a notable strength that could support broader applicability and debugging.

major comments (3)
  1. [§3.2] §3.2 (Template Library Construction): The MCTS procedure on the seed set is described at a high level, but the reward formulation, template extraction criteria, and exact representation of templates (e.g., as sequences of reasoning steps or constraints) are not specified. This is load-bearing for the claim that the templates encode transferable high-level strategies rather than seed-specific patterns.
  2. [§5.1, Table 2] §5.1 and Table 2 (Performance Results): The reported 99% improvement on AIME and 41% on AMC are presented without rollout counts per problem, number of independent training runs, standard deviations, or statistical significance tests. In the absence of these controls it is impossible to rule out that the gains arise from variance, implicit narrowing of the rollout distribution, or overfitting to the seed-set distribution rather than genuine strategy transfer.
  3. [§4.1] §4.1 (Template-Guided Policy Optimization): The mechanism by which templates are injected into the RL objective (e.g., as an auxiliary loss, constrained sampling, or modified reward) is not formalized with equations or pseudocode. Without this formalization it is difficult to assess whether the method avoids the exploration-collapse risk highlighted in the stress-test note when templates are applied to held-out problems whose distribution differs from the seed set.
minor comments (2)
  1. [Abstract, §5.3] The abstract and §5.3 claim 'remarkable cross-domain generalization,' yet the experimental section does not explicitly list the held-out domains or quantify distribution shift between seed and test sets.
  2. [§3.3] Notation for the template library and its update rule is introduced without a clear table or diagram; a small illustrative example of a template before and after an online edit would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to provide the requested clarifications and additional empirical controls.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Template Library Construction): The MCTS procedure on the seed set is described at a high level, but the reward formulation, template extraction criteria, and exact representation of templates (e.g., as sequences of reasoning steps or constraints) are not specified. This is load-bearing for the claim that the templates encode transferable high-level strategies rather than seed-specific patterns.

    Authors: We agree that the current description in §3.2 is high-level and would benefit from greater precision. In the revised manuscript we will expand this section to specify the MCTS reward (a combination of final-answer correctness and per-step verification), the extraction criteria (paths with success rate above a threshold and low variance on semantically similar problems), and the template representation (ordered sequences of high-level reasoning steps together with explicit constraint predicates). These additions will make explicit how the templates capture reusable strategic patterns rather than seed-specific artifacts. revision: yes

  2. Referee: [§5.1, Table 2] §5.1 and Table 2 (Performance Results): The reported 99% improvement on AIME and 41% on AMC are presented without rollout counts per problem, number of independent training runs, standard deviations, or statistical significance tests. In the absence of these controls it is impossible to rule out that the gains arise from variance, implicit narrowing of the rollout distribution, or overfitting to the seed-set distribution rather than genuine strategy transfer.

    Authors: We accept that the current presentation lacks the statistical safeguards needed to support the reported gains. We will revise §5.1 and Table 2 to report rollout counts per problem (16 for AIME, 8 for AMC), results aggregated over five independent training runs with distinct random seeds, standard deviations, and p-values from paired t-tests against GRPO. These additions will allow readers to assess whether the improvements exceed what could be explained by variance or seed-set overfitting. revision: yes

  3. Referee: [§4.1] §4.1 (Template-Guided Policy Optimization): The mechanism by which templates are injected into the RL objective (e.g., as an auxiliary loss, constrained sampling, or modified reward) is not formalized with equations or pseudocode. Without this formalization it is difficult to assess whether the method avoids the exploration-collapse risk highlighted in the stress-test note when templates are applied to held-out problems whose distribution differs from the seed set.

    Authors: We concur that an explicit formalization is required. The revised §4.1 will include the precise objective (the original GRPO loss augmented by a KL term that penalizes deviation from the template-constrained sampling distribution) together with pseudocode in the appendix. On the exploration-collapse concern, we note that our cross-domain experiments already apply seed-derived templates to out-of-distribution problems and observe no collapse; the online editability of the template library further permits adaptation. We will add a short discussion of this mitigation in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method builds on external components

full rationale

The paper describes a method that first extracts templates via MCTS on a small seed set and then incorporates the resulting library as guidance within standard RL policy optimization (e.g., compared to GRPO). Reported gains on AIME and AMC are presented as experimental outcomes from benchmark evaluation rather than quantities derived by construction from the same fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes are shown reducing the central claims to the inputs; the template construction and RL integration remain independent steps whose validity can be checked against held-out data and external baselines. This keeps the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract supplies insufficient detail to enumerate concrete free parameters or invented entities; the central claim rests on the untested transferability of MCTS templates.

axioms (1)
  • domain assumption Templates extracted via MCTS on a small seed set capture generalizable problem-solving strategies.
    Invoked when claiming cross-domain generalization and reduced ineffective exploration.

pith-pipeline@v0.9.0 · 5753 in / 1187 out tokens · 49851 ms · 2026-05-22T13:35:52.411804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

    cs.LG 2026-05 conditional novelty 6.0

    DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.

  2. SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

    cs.LG 2026-04 conditional novelty 6.0

    Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.

  3. EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

    cs.LG 2025-08 unverdicted novelty 6.0

    EvoCoT uses self-generated and verified CoT trajectories in a two-stage curriculum to let LLMs learn from initially unsolved hard problems in RLVR settings.

  4. Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

    cs.CL 2026-04 accept novelty 5.0

    LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 4 Pith papers · 3 internal anchors

  1. [1]

    rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

    rstar-math: Small llms can master math reason- ing with self-evolved deep thinking.arXiv preprint arXiv:2501.04519. 9 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint ...

  2. [2]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai- Wei Chang, Michel Galley, and Jianfeng Gao. 2023. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.0225...

  3. [3]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. 2024a. Measuring multimodal mathematical reason- ing with MATH-vision dataset. InThe Thirty-eight Conference on Neural Information Processing Sys- tems Datasets and Bench...

  4. [4]

    Structure-Guided RL Paradigm: Aug- menting policy optimization with ex- plicit structured templates significantly improves training efficiency, stability, and generalization over unstructured self- sampling

  5. [5]

    Transferable High-Level Strategic Pat- terns: Abstract reasoning templates ex- hibit remarkable cross-domain and cross- task transferability, enabling knowledge accumulation and flexible expert inter- vention through interpretable, editable structures

  6. [6]

    Effective Learning on Weaker Mod- els: Template-guided training compen- sates for limited model capacity, achiev- ing stable training where standard RL fre- quently collapses. Existing RL methods for LLM reasoning rely pri- marily on unstructured self-sampling to fit scalar re- wards, producing inefficient rollouts that fail to cap- ture transferable prob...

  7. [7]

    In particular, if groups are symmetric (ph ≡ pgrp), then P batch ≥1 = 1−(1−p grp)m, which is strictly increasing in m and ap- proaches 1 as m→ ∞ for any fixed pgrp >0

    The probability that the entire batch ofGtra- jectories contains at least one positive sample equals P batch ≥1 = 1− mY h=1 (1−p h). In particular, if groups are symmetric (ph ≡ pgrp), then P batch ≥1 = 1−(1−p grp)m, which is strictly increasing in m and ap- proaches 1 as m→ ∞ for any fixed pgrp >0

  8. [8]

    Assume group contributions are independent with E[gh] = ¯g grp and Var[gh] = Σ grp (bounded)

    Let gh denote the (vector-valued) gradient contribution from grouph (e.g., the average of per-trajectory contributions within the group). Assume group contributions are independent with E[gh] = ¯g grp and Var[gh] = Σ grp (bounded). Then the variance of the batch- averaged estimatorˆg= 1 m Pm h=1 gh satisfies Var[ˆg]⪯ 1 mΣgrp +o 1 m , i.e. the leading-orde...

  9. [9]

    If rh > p policy h (the per-group suc- cess probability under policy rollouts), tem- plate transfer strictly improves per-group suc- cess

    Per-mini-group: the event that mini-group h contains at least one positive trajectory due to template transfer occurs with probability at least rh. If rh > p policy h (the per-group suc- cess probability under policy rollouts), tem- plate transfer strictly improves per-group suc- cess

  10. [10]

    Proof sketch

    Batch-level: the probability that the full batch has at least one positive sample is P template ≥1 = 1− mY h=1 (1−r h), which exceeds the policy-rollout probability 1−Qm h=1(1−p policy h ) whenever rh ≥p policy h for allhand strict for someh. Proof sketch. (1) and (2) follow from the same complement argument as in the proof of Propo- sition 3.1 (B.2). B.4...

  11. [11]

    System 1

    Thus, for new queries q that are close to q′ (small dPCC(q,q ′)), the inequality in Lemma B.1 is likely to hold. This implies that retrieving high- quality templates from similar seed problems leads to transferred trajectories with positive advantage on the new problem. Consequently, the per-group template success probability rh is strictly higher than th...

  12. [12]

    DC (Divide and Conquer): Breaking down a complex reasoning problem into several smaller subproblems and progressively solving them to achieve the overall solution

  13. [13]

    DC→DC→CoT

    CoT (Chain-of-Thought): Facilitating step-by-step reasoning by constructing a logical sequence of intermediate thoughts, where each step incrementally builds on the previous ones. Action DescriptionAbstract Figure 10: Thought Template Visualization. On the left, the seed dataset contains 3 questions with the same reasoning pattern, “DC→DC→CoT”, correspond...

  14. [14]

    Calculate differences:For each benchmark pair, compute the difference Di =X i − Yi where Xi represents TemplateRL per- formance and Yi represents baseline perfor- mance

  15. [15]

    Rank differences:Take absolute values |Di| and rank them from smallest to largest as Ri, with average ranks assigned for ties

  16. [16]

    Assign signs to ranks:For each difference Di, assign its sign to the corresponding rank: R′ i =sign(D i)·R i

  17. [17]

    Calculate rank sums:Compute positive and negative rank sums: W + = P Di>0 R′ i and W − =P Di<0 R′ i

  18. [18]

    Determine test statistic:The test statistic is W= min(W +, W −)

  19. [19]

    The problem involves ... final answer is 46

    Calculate p-value:Derive the p-value from the distribution of test statisticW. We test the null hypothesis H0: no significant difference between methods against the alterna- tive hypothesis H1: significant difference exists. 22 Comparing TemplateRL with GRPO on Qwen2.5- Math-7B-Base across all benchmarks, we obtain a p-value of 0.0156. Using a significanc...

  20. [20]

    Given that the highest score is 98:

  21. [21]

    After removing the highest and lowest scores, there are 10 scores left: So, we have:

  22. [22]

    Solve for the sum of the highest and lowest scores:

  23. [23]

    We know the highest score is 98:

  24. [24]

    The problem involves … final answer is

    Substitute into the equation for the sum of the highest and lowest scores: Thus, the lowest score is: The lowest score is . The problem involves … final answer is . The problem involves … final answer is . …(repeat) ∑12 i=1xi=12×82=984 12 ∑i=1 xi−x1−x12=984−x1−x12 11 ∑i=2 xi=10×84=840 984−x1−x12=840 x1+x12=984−840=144 x1=98 x1 98+x12=144 46 46 46 46 Let's...