arxiv: 2604.06159 · v1 · submitted 2026-04-07 · 💻 cs.LG

Recognition: no theorem link

Target Policy Optimization

Jean Kaddour

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords target policy optimizationreinforcement learningpolicy gradientssparse rewardscross-entropy losslarge language modelsRLVR

0 comments

The pith

Target Policy Optimization separates deciding which completions to favor from how parameters should update by building an explicit target distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In reinforcement learning with scored completions, standard policy gradient methods answer both which outputs deserve higher probability and how to adjust the model in a single update step, making them sensitive to learning rate and clipping choices. Target Policy Optimization decouples these by first constructing a target distribution q proportional to the old policy probability times the exponential of the utility score for each completion. It then trains the current policy to match this target using cross-entropy loss. The resulting gradient on the sampled logits simplifies to the difference between current policy probabilities and the target, which reaches zero once alignment occurs. Tests on bandits, sequence tasks, and large language models show TPO matches existing methods on easy problems but delivers stronger results when rewards are sparse.

Core claim

Given scored completions sampled from an old policy, TPO constructs a target distribution q_i proportional to p_i^old times exp(u_i) and fits the current policy to q via cross-entropy minimization. This produces a loss gradient of p^theta minus q on the sampled logits that vanishes automatically upon matching the target, thereby separating the choice of favored completions from the mechanics of the parameter update.

What carries the argument

The target distribution q_i ∝ p_i^old exp(u_i) together with cross-entropy fitting of the policy to q, yielding gradient p^theta - q.

If this is right

The update direction becomes independent of specific learning-rate or clipping choices that normally control overshoot.
Performance gains appear specifically under sparse reward conditions across tabular, sequence, and billion-parameter settings.
The loss reaches zero naturally once the policy probabilities align with the constructed target.
The same construction and fitting procedure works without modification from small bandits to large language model RLVR.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit target may reduce the need for variance-reduction baselines or advantage normalization in other RL algorithms.
Because the target mixes old policy mass with utility, TPO could be adapted to incorporate offline data or human preferences more directly.
In generative modeling, repeated application might produce policies whose output distributions more closely track desired utility landscapes without iterative clipping.

Load-bearing premise

That constructing the target from the old policy and utilities and then fitting via cross-entropy produces more stable or higher-performing updates than standard policy-gradient methods, particularly when rewards are sparse.

What would settle it

A controlled experiment in which TPO fails to match or exceed the performance of PG, PPO, GRPO, or DG on a sparse-reward task while the constructed target remains well-defined.

Figures

Figures reproduced from arXiv: 2604.06159 by Jean Kaddour.

**Figure 1.** Figure 1: TPO matches baselines on easy tasks and outperforms them under sparse reward. (a) On an MNIST contextual bandit with dense reward, TPO converges slightly faster than GRPO and DG. (b) On a sparse-reward token-reversal task (reward only at end of sequence), GRPO and DG stall near random while TPO solves the task. Both panels show mean ± s.e. over 20 seeds. Abstract In RL, given a prompt, we sample a group of… view at source ↗

**Figure 2.** Figure 2: Implementation sketch. log_scores contains the policy log-probabilities of the sampled candidates, renormalized by log_softmax to form the policy over the group; u contains standardized task scores; eta is an optional temperature with default value 1. The sketch shows the simplest on-policy implementation, where the same log_scores tensor is used both to form q and to compute log_p, with q detached from th… view at source ↗

**Figure 3.** Figure 3: Single-context symmetric bandit (K=100, B=100, normalized steps). (a) TPO and DG converge fastest; GRPO and PG plateau at higher error. (b) TPO maintains the lowest misalignment to the oracle gradient throughout training. sample K=8 next-token candidates at each prefix state; for terminal reward, we use sequence-level TPO and GRPO with K=8 full rollouts per prompt; for LLM RLVR, K=16. PPO, GRPO, and TPO ta… view at source ↗

**Figure 4.** Figure 4: Multi-context bandit (N=100, K=10, exact gradients). (a) All methods converge; the CE oracle is fastest. (b) TPO achieves near-zero misalignment to the CE oracle direction, confirming its update direction targets the optimal allocation. start, but TPO overtakes them after the early transient and finishes with the lowest error of the three. The misalignment panel shows the same pattern more clearly: TPO rem… view at source ↗

**Figure 5.** Figure 5: MNIST contextual bandit: TPO converges fastest and reaches the lowest error. (a) Learning curves for all single-sample bandit updates, including the same-signal ablation Group PG. (b) At step 2,000, for each misclassified example we measure how much more each method increases the true-class probability py compared to a generic one-vs-rest baseline (Appendix C), binned by wrong-class concentration c = maxj̸… view at source ↗

**Figure 6.** Figure 6: Token Reversal (bag-of-tokens reward, K=8 token candidates). All methods use B=100 prompts and follow one behavior trajectory each; TPOtoken and GRPOtoken additionally sample K next-token candidates at each prefix state. Columns vary vocabulary size V ∈ {2, 4, 8, 16} [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Task variations, prompt- and interaction-matched. Top two rows: prompt-matched. Bottom two rows: interaction-matched. Within each pair, the first row is bag-of-tokens reward and the second is sequential reward. Columns vary target logic [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Terminal reward, prompt- and interaction-matched. Top row: prompt-matched (B=100 for all methods). Bottom row: interaction-matched (B·K=800 rollouts per step, with single-sample batch size and learning rate scaled by K and √ K respectively). Here grouped methods use K=8 candidates per prompt. Y-axis: exact-match error. TPO has the lowest error at each H under both matching conditions [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 9.** Figure 9: Removing the anchor, KL penalty, or target matching each degrades learning. Terminal reward, reverse-copy targets, V =2, K=8, B=100, 20 seeds. Shading shows ±1 s.e. 0 20 40 60 Step 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy GSM8K Qwen3-1.7B 50 100 150 200 250 300 Step Graph color Qwen3-1.7B 50 100 150 200 Step Knights & Knaves Qwen3-1.7B 0 20 40 60 Step 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy GSM8K R1-Distill-1.5B 50 100 1… view at source ↗

**Figure 10.** Figure 10: LLM RLVR. Top row: Qwen3-1.7B. Bottom row: DeepSeek-R1-Distill-Qwen-1.5B. All runs use K=16 rollouts per prompt. Columns: GSM8K (held-out test accuracy, evaluated every 5 steps), Reasoning Gym graph coloring (train mean score), Reasoning Gym Knights & Knaves (train mean score). On GSM8K ( [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: TPO’s gradient self-extinguishes; GRPO’s does not (H=8, V =2, K=32). (a) Gradient L2 norms over training. (b) Per-candidate weight proxy on successful (solid) vs. failed (dashed) candidates: mean target mass qi for TPO, mean |Ai | for GRPO. 0 500 1000 1500 2000 Episode 0.0 0.2 0.4 0.6 0.8 1.0 All-fail fraction (a) TPO (4 ep) TPO (1 ep) GRPO GRPO (no KL) 0 500 1000 1500 2000 Episode 0.0 0.2 0.4 0.6 0.8 1.0… view at source ↗

**Figure 12.** Figure 12: Most groups carry no signal early in training; TPO eliminates them fastest (H=8, V =2, K=32). (a) Fraction of groups where all K candidates fail. (b) Fraction of prompts with at least one successful candidate. (H=8, V =2, K=32, B=256, 2,000 episodes). We compute per-step diagnostics from the original 10-seed runs; the K-sweep, epoch sweep, and masking ablations use 30 seeds. 4.1 Does TPO’s gradient vanish… view at source ↗

**Figure 13.** Figure 13: Group-size sensitivity sweep (H=8, V =2, epochs=4). (a) TPO learning curves: steady improvement as K grows, with the strongest performance at K=64. (b) GRPO learning curves: larger groups help, but performance remains less stable and less monotonic. (c) Final error vs. K: TPO improves from 8.9% at K=4 to 0.36% at K=64; GRPO improves from 19.4% at K=4 to 4.4% at K=32 and then worsens slightly at K=64 (5.6%… view at source ↗

**Figure 14.** Figure 14: Zero-variance masking (H=8, V =2, K=32, epochs=4). (a) Learning curves: GRPO (zv-masked) is substantially worse than both GRPO and TPO. (b) Final error: masking increases GRPO from 6.3% to 29.7%, while TPO reaches 0.05% without any masking. 30 seeds, shading/bars ±1 s.e. 0 500 1000 1500 2000 Episode 0.0 0.2 0.4 0.6 0.8 1.0 Exact-match error (a) TPO (4 ep) TPO (1 ep) DG 0 500 1000 1500 2000 Episode 0.0 2.5… view at source ↗

**Figure 15.** Figure 15: Multi-epoch extraction (H=8, V =2, K=32). (a) Error curves: TPO with 4 gradient epochs reaches 0.2% error at episode 400 while TPO with 1 epoch is at 1.1%, roughly 5× faster. Both eventually converge to <0.1%. DG, limited to a single epoch, plateaus at 14%. (b) Gradient norms: TPO (4 ep) gradient decays fastest; TPO (1 ep) shows a delayed spike and slower decay; DG’s gradient stays low but persistent. 4.3… view at source ↗

**Figure 16.** Figure 16: Epoch-count ablation (H=8, V =2, K=32). (a) TPO learning curves across epoch counts: all converge smoothly and remain low-error throughout. (b) GRPO learning curves: 2 epochs is the worst setting, while 8 and 16 epochs recover strongly. (c) Final error comparison: TPO stays below 2.3% everywhere; GRPO is strongly non-monotonic (37.6% at 2 epochs, 1.1% at 16 epochs). 30 seeds, shading ±1 s.e. No single pro… view at source ↗

**Figure 17.** Figure 17: TPO temperature ablation. All values in [0.25, 2] converge within 141 episodes; only η=4 is meaningfully slower. Performance is robust across a 16× range [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: DG epoch sensitivity across sparse- and dense-reward transformer tasks. (a) Reversecopy transformer RLVR with terminal reward, 20 seeds: reusing each rollout batch for 4 DG gradient epochs keeps the error high (48.3% final) while the standard 1-epoch DG update reaches 2.0%. (b,c) Final error on the eight prompt-matched token-reversal variants from Section 3.5 (H=10, V =2, K=8 token candidates, 10 seeds),… view at source ↗

read the original abstract

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution $q_i \propto p_i^{\,\mathrm{old}} \exp(u_i)$ and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is $p^\theta - q$, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TPO separates target construction from the update via cross-entropy to a reweighted old policy, with early signs of better sparse-reward performance but thin experimental reporting.

read the letter

Hi, the one or two things to know about this paper are that TPO builds a target distribution q proportional to the old policy times exp of the utility scores, then fits the current policy to it with cross-entropy, and that this produces a gradient of p_theta minus q on the sampled logits that goes to zero at match. The separation is the core move, and the early results point to gains over standard methods when rewards are sparse. What is new is the explicit two-step framing: first construct q from the previous policy and external utilities, then do plain cross-entropy fitting rather than bundling the decisions inside a single policy-gradient step. This is distinct from how PPO, GRPO, or direct gradient methods handle the same scored completions. The paper does well by testing across scales, from tabular bandits to transformer sequences to billion-parameter LLM RLVR, where it matches the baselines on easy tasks and pulls ahead under sparse reward. Releasing the code helps. The soft spots are in the experimental reporting. The abstract states the outperformance but gives no setup details, controls for hyperparameters, statistical tests, or ablations, so it is hard to judge how robust the gains really are or whether they survive different sampling or optimization choices. The math on the gradient and the vanishing property is internally consistent and does not hide circularity, but the stability advantage over policy gradients is presented as an observed outcome rather than a proven necessity. This paper is for people working on RL for LLMs, especially reasoning or alignment where rewards are sparse and good outputs are rare. A reader in that area would get practical value from trying the method. It deserves a serious referee because the framing is clean, the empirical signal is positive enough to check, and the idea is accessible even if the details need filling in. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Target Policy Optimization (TPO) for reinforcement learning. Given a prompt and a group of scored completions, TPO constructs a target distribution q_i ∝ p_i^old exp(u_i) from the previous policy and utilities, then fits the current policy to q via cross-entropy loss. The resulting gradient on the sampled-completion logits is p^θ - q and vanishes when the policy matches the target. Empirical evaluations on tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR show TPO matching PG, PPO, GRPO, and DG on easy tasks while substantially outperforming them under sparse rewards. Code is provided for reproducibility.

Significance. If the results hold, TPO offers a clean separation between target construction and parameter updates that yields a well-defined, vanishing gradient at match. This could improve stability and performance in sparse-reward settings common to LLM alignment and RLVR. The open-source code is a clear strength that supports verification and extension.

major comments (2)

[Abstract] Abstract: the claim of substantial outperformance under sparse reward is presented without any quantitative results, number of runs, error bars, or statistical tests, which is load-bearing for assessing whether the observed gains are reliable or merely anecdotal.
[Method] Method description: the gradient statement 'p^θ - q' on sampled-completion logits assumes a softmax over the group and no additional normalization; the manuscript should derive this explicitly (including how group normalization interacts with the proportionality in q) to confirm it does not introduce new instabilities.

minor comments (2)

[Abstract] Notation: p_i^old is written with a superscript in one place and subscript in another; consistent subscript notation would improve readability.
[Introduction] The manuscript should add a short paragraph contrasting TPO with KL-regularized methods to clarify whether the target q implicitly encodes a similar regularizer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and will incorporate the suggested changes to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of substantial outperformance under sparse reward is presented without any quantitative results, number of runs, error bars, or statistical tests, which is load-bearing for assessing whether the observed gains are reliable or merely anecdotal.

Authors: We agree that the abstract would benefit from quantitative support for the sparse-reward claim. In the revised version we will add specific metrics drawn from the LLM RLVR experiments (e.g., mean reward improvement and standard deviation over the reported number of runs) while keeping the abstract concise. revision: yes
Referee: [Method] Method description: the gradient statement 'p^θ - q' on sampled-completion logits assumes a softmax over the group and no additional normalization; the manuscript should derive this explicitly (including how group normalization interacts with the proportionality in q) to confirm it does not introduce new instabilities.

Authors: We will insert a short derivation in the Method section. The cross-entropy loss is L = −∑_i q_i log p^θ_i with p^θ the softmax over the group; its gradient w.r.t. the logits is exactly p^θ − q. Because q is already normalized to sum to one within the same group (q_i ∝ p_old_i exp(u_i) followed by group-level normalization), the proportionality does not introduce extra scaling factors or instabilities beyond ordinary cross-entropy training. The added paragraph will contain the full derivation and a brief stability remark. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core construction defines the target q_i explicitly and independently as proportional to the previous policy p_old multiplied by exp(u_i) from external scores, then applies a standard cross-entropy objective whose gradient p^θ - q vanishes at equality by the algebraic property of the loss itself. This separation does not redefine any quantity in terms of the current parameters, invoke self-citations for uniqueness or ansatz justification, or rename a fitted input as a prediction; the update rule follows directly from the chosen objective without reducing the claimed stability advantage to a tautology. Empirical results on bandits, sequence tasks, and LLMs are reported as observed performance rather than derived necessities, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard RL assumptions that completions can be sampled from the current policy and that external utility scores u_i are available and fixed. No new free parameters, axioms beyond domain assumptions, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Utility scores u_i for sampled completions are provided by an external oracle or verifier and remain fixed during the update.
The target construction step presupposes access to these scores; the abstract treats them as given inputs.

pith-pipeline@v0.9.0 · 5465 in / 1464 out tokens · 53122 ms · 2026-05-10T19:16:13.488340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 20 canonical work pages · 14 internal anchors

[1]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs.arXiv preprint arXiv:2402.14740,

work page internal anchor Pith review arXiv
[2]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. KTO: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review arXiv
[4]

Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J

Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, and Wen Sun. REBEL: Reinforcement learning via regressing relative rewards.arXiv preprint arXiv:2404.16767,

work page arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, He Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu. REINFORCE++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262,

work page internal anchor Pith review arXiv
[7]

Dhillon, David Brandfonbrener, and Rishabh Agarwal

URLhttps://arxiv.org/abs/2510.13786. Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, P...

work page arXiv
[8]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

URL https://arxiv.org/abs/ 2411.15124. Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. ReMax: A simple, effective, and efficient reinforcement learning method for aligning large language models. InInternational Conference on Machine Learning,

work page internal anchor Pith review arXiv
[9]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

URLhttps://arxiv.org/abs/2601.05242. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like training: A critical perspective.arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review arXiv
[10]

Ian Osband

URL https://arxiv.org/abs/ 2412.05265. Ian Osband. Delightful policy gradients.arXiv preprint arXiv:2603.14608,

work page arXiv
[11]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. InarXiv preprint arXiv:1910.00177,

work page internal anchor Pith review arXiv 1910
[12]

Multi-task grpo: Reliable llm reasoning across tasks.arXiv preprint arXiv:2602.05547, 2026

URLhttps://arxiv.org/abs/2602.05547. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational Conference on Machine Learning, pages 1889–1897. PMLR,

work page arXiv
[13]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256,

work page internal anchor Pith review arXiv
[16]

Maximum likelihood reinforcement learning, 2026

URL https://openreview.net/forum?id= GqYSunGmp7. Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning.arXiv preprint arXiv:2602.02710,

work page arXiv
[17]

arXiv preprint arXiv:2510.18855 , year=

URLhttps://arxiv.org/abs/2510.18855. Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. InInternational Conference on Learning Representations,

work page arXiv
[18]

Qwen3 Technical Report

18 An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Wang, Bowen Zheng, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. GSPO: Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

The derivation is inlogit space, matching the experiment and Osband (2026): the policy in each context is a softmax over explicit logits

B Multi-context tabular weighting derivation This appendix derives the effective per-context coefficients for the multi-context tabular bandit in Section 3.2. The derivation is inlogit space, matching the experiment and Osband (2026): the policy in each context is a softmax over explicit logits. To avoid overloadingK from the main text, let A denote the n...

2026
[22]

correct versus incorrect

Single-sample GRPO.In the implemented MNIST variant, rewards are standardized across the minibatch: AB(a) = 1{a=y} −µ B σB , where µB and σB are the minibatch reward mean and standard deviation. Conditioning on the realized minibatch statistics(µ B, σB)for one example, the expected update is gGRPO|µB ,σB =E[A B(a) (ea −π)] = p σB (ey −π) = p σB v. Thus th...

2026