FGRPO: Federated GRPO with Adaptive Aggregation on Non-IID Data

Feng Li; Jun Luo; Kai Han; Kai Wang; Pengyu Chen; Shaowei Li; Yunsheng Yuan

arxiv: 2606.03094 · v1 · pith:DI6LSVBKnew · submitted 2026-06-02 · 💻 cs.LG

FGRPO: Federated GRPO with Adaptive Aggregation on Non-IID Data

Pengyu Chen , Shaowei Li , Kai Wang , Yunsheng Yuan , Kai Han , Jun Luo , Feng Li This is my paper

Pith reviewed 2026-06-28 11:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated learninggroup relative policy optimizationreinforcement learningnon-IID dataadaptive aggregationprivacy preservationreasoning models

0 comments

The pith

FGRPO decentralizes GRPO fine-tuning across data owners with adaptive aggregation on relative performance gains to converge on non-IID data while keeping raw data private.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents federated group relative policy optimization (FGRPO) as a way to fine-tune reasoning models on distributed, private datasets without moving the data to a central server. It adds an adaptive aggregation step that scores each client's progress against that client's own past results rather than a global scale. This step is meant to keep training stable when reward magnitudes and task difficulties differ across owners. A reader would care if the method lets multiple parties improve long-chain reasoning models collaboratively without exposing their data.

Core claim

FGRPO is a framework that decentralizes the fine-tuning of reasoning models across heterogeneous data owners by incorporating an adaptive aggregation mechanism based on relative performance gain, where each client's improvement is characterized relative to its personalized historical baseline, thereby dynamically prioritizing effective learning trajectories and ensuring robust convergence on non-IID data while preserving data privacy.

What carries the argument

The adaptive aggregation mechanism based on relative performance gain, which measures each client's improvement against its own historical baseline to prioritize learning trajectories and reduce instability from divergent reward scales.

If this is right

Fine-tuning of reasoning models can proceed across separate owners without centralizing raw data.
Training stability holds when local datasets are non-IID and reward scales vary.
Only aggregated model updates need to be exchanged, satisfying privacy constraints.
Clients with easier local tasks do not dominate the global update because gains are normalized to each client's history.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same relative-gain idea could be tested in other federated reinforcement-learning algorithms that lack a critic.
If the baseline update rule proves sensitive to the length of history kept per client, longer histories might improve robustness on very heterogeneous data.
Real-world deployment would require checking whether communication cost of the extra baseline statistics remains acceptable at scale.

Load-bearing premise

That scoring each client against its own past performance will reliably select useful updates and stabilize training even when tasks differ sharply in difficulty and reward magnitude.

What would settle it

A side-by-side run of standard federated GRPO versus FGRPO on the same non-IID collection of tasks with mismatched reward scales, where the version without relative-gain aggregation diverges while FGRPO converges.

Figures

Figures reproduced from arXiv: 2606.03094 by Feng Li, Jun Luo, Kai Han, Kai Wang, Pengyu Chen, Shaowei Li, Yunsheng Yuan.

**Figure 2.** Figure 2: Convergence performance of different algorithms in terms of average reward with Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Test accuracy convergence trajectories of the different models (Qwen3-4B and Llama-3.2- [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: Average reward trajectories of the different models (Qwen3-4B and Llama-3.2-11B) on [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Test accuracy under different numbers of clients. [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Average reward trajectories under varying numbers of clients on OpenR1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Average reward trajectories under varying numbers of clients on GEOQA datasets. [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Test accuracy under varying data heterogeneity on OpenR1 and GEOQA datasets. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Average reward trajectories of different algorithms under varying non-IID data settings on [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Average reward trajectories of different algorithms under varying non-IID data settings on [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Average reward trajectories of different federated RLVR algorithms on OpenR1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Average reward trajectories of different federated RLVR algorithms on GEOQA dataset. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Average GPU and memory utilization across different models. [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

read the original abstract

Recent advances in language models have established reinforcement learning as the primary paradigm for eliciting self-correction and long-chain reasoning. While group relative policy optimization (GRPO) offers superior scalability by eliminating the critic network, deploying it on a central infrastructure entails collecting a large volume of data from distributed owners, which poses significant privacy risks. To address these concerns, we introduce federated GRPO (FGRPO), a framework designed to decentralize the fine-tuning of reasoning models across heterogeneous data owners. To effectively mitigate the instability caused by divergent reward scales across heterogeneous tasks, FGRPO incorporates an adaptive aggregation mechanism based on relative performance gain. By characterizing each client's improvement relative to its personalized historical baseline, the framework dynamically prioritizes effective learning trajectories regardless of local task difficulty. FGRPO ensures robust convergence on non-IID data while preserving data privacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FGRPO proposes a federated GRPO variant with relative-gain adaptive aggregation but supplies no experiments, equations, or derivations to support its convergence claims.

read the letter

The paper's main contribution is FGRPO, which takes GRPO and moves it into a federated setting with an adaptive aggregation step that weights clients by their improvement over a personalized historical baseline. This targets the privacy problem of centralizing data for reasoning-model RL and the instability from non-IID reward scales.

The high-level idea is sensible. Normalizing updates by relative gain rather than raw reward is a straightforward way to try to equalize across heterogeneous tasks, and the privacy motivation is real for distributed owners of reasoning data.

The problem is that none of the central claims are backed by anything. The abstract asserts robust convergence on non-IID data but shows no experiments, no error bars, no ablation of the aggregation rule, and no equations defining the baseline or the weighting function. Without those details it is impossible to judge whether the mechanism actually stabilizes training or simply introduces a new source of circularity. The stress-test note correctly flags that we cannot yet isolate a concrete inconsistency because the manuscript text is not visible here, but the absence of any supporting evidence in the provided description is itself the issue.

This is an early-stage proposal aimed at the intersection of federated learning and RL for language models. Readers already working on privacy-preserving distributed training might pick up the relative-gain idea as a prompt for their own experiments. It does not yet contain enough to cite, reproduce, or build on directly.

I would not send this to peer review in its current form. It needs at least preliminary results or a clear formal argument before referee time is warranted.

Referee Report

3 major / 0 minor

Summary. The paper proposes Federated GRPO (FGRPO), extending Group Relative Policy Optimization to a federated setting for decentralizing RL fine-tuning of reasoning models across distributed, privacy-sensitive data owners. It introduces an adaptive aggregation rule that weights client updates according to each client's relative performance gain against a personalized historical baseline, with the goal of stabilizing training under non-IID data and heterogeneous reward scales. The central claim is that this mechanism ensures robust convergence while preserving data privacy.

Significance. If the adaptive aggregation rule can be shown to be well-defined and empirically effective, the work would address a practically important gap between scalable RL methods such as GRPO and the privacy constraints of distributed model owners. No machine-checked proofs, reproducible code, or parameter-free derivations are presented, so the significance rests entirely on whether the empirical and theoretical support supplied in the full manuscript substantiates the convergence claim.

major comments (3)

[Abstract] Abstract: the assertion that FGRPO 'ensures robust convergence on non-IID data' is presented without any experimental results, learning curves, ablation studies, or statistical tests. Because the central claim is empirical, the absence of supporting evidence is load-bearing.
[Abstract] Abstract: the adaptive aggregation is defined in terms of 'relative performance gain' computed against a 'personalized historical baseline,' yet no equation, algorithm, or pseudocode is supplied for either quantity. Without these definitions it is impossible to determine whether the weighting rule is independent of the data or reduces to a post-hoc fit, directly affecting the soundness of the non-IID stability claim.
No section or equation is referenced for the convergence argument. The manuscript must supply either a formal convergence proof under standard federated assumptions or a reproducible experimental protocol with error bars before the robustness claim can be evaluated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below and indicate the revisions that will be incorporated.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that FGRPO 'ensures robust convergence on non-IID data' is presented without any experimental results, learning curves, ablation studies, or statistical tests. Because the central claim is empirical, the absence of supporting evidence is load-bearing.

Authors: We agree that the abstract states an empirical claim without referencing supporting evidence. The revised manuscript will qualify the claim in the abstract or add an explicit reference to the experimental results (including learning curves, ablations, and statistical tests) presented in Section 4. revision: yes
Referee: [Abstract] Abstract: the adaptive aggregation is defined in terms of 'relative performance gain' computed against a 'personalized historical baseline,' yet no equation, algorithm, or pseudocode is supplied for either quantity. Without these definitions it is impossible to determine whether the weighting rule is independent of the data or reduces to a post-hoc fit, directly affecting the soundness of the non-IID stability claim.

Authors: We acknowledge that the abstract does not include the formal definitions. The revised version will add the equations for relative performance gain (computed against the per-client moving-average baseline) and the resulting aggregation weights, along with the corresponding algorithm pseudocode, either in the abstract (if space allows) or with a clear pointer from the abstract to Section 3. revision: yes
Referee: [—] No section or equation is referenced for the convergence argument. The manuscript must supply either a formal convergence proof under standard federated assumptions or a reproducible experimental protocol with error bars before the robustness claim can be evaluated.

Authors: We agree that the current text does not reference a convergence argument. The revised manuscript will either include a proof sketch under standard federated assumptions (bounded heterogeneity, Lipschitz rewards) or expand the experimental protocol description in Section 4 to include multiple independent runs with error bars, ensuring the robustness claim is properly substantiated. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description introduce FGRPO and its adaptive aggregation mechanism conceptually, characterizing client improvements relative to personalized historical baselines to handle non-IID data. No equations, self-citations, or derivation steps are present that reduce any claim to its own inputs by construction (e.g., no fitted parameters renamed as predictions or ansatzes smuggled via citation). The framework is presented as a proposed method rather than a mathematical derivation, making it self-contained against external benchmarks with no load-bearing circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review based on abstract only; full details of parameters, assumptions, and methods unavailable.

invented entities (1)

relative performance gain no independent evidence
purpose: to mitigate instability caused by divergent reward scales and dynamically prioritize learning trajectories
Introduced in the abstract as the core of the adaptive aggregation mechanism.

pith-pipeline@v0.9.1-grok · 5686 in / 1118 out tokens · 28615 ms · 2026-06-28T11:42:46.793476+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 14 canonical work pages · 9 internal anchors

[1]

Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. InProc. of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 12248–12267, 2024

2024
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

R1-V: Reinforcing Super Gen- eralization Ability in Vision-Language Models with Less Than $3

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-V: Reinforcing Super Gen- eralization Ability in Vision-Language Models with Less Than $3. https://github.com/ Deep-Agent/R1-V, 2025. Accessed: 2025-02-02

2025
[5]

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025a

Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization.arXiv preprint arXiv:2505.12346, 2025

work page arXiv 2025
[6]

Open-R1-Multimodal: A Fork to Add Multimodal Model Training to Open-R1.https://github.com/EvolvingLMMs-Lab/open-r1-multimodal, 2025

EvolvingLMMs-Lab. Open-R1-Multimodal: A Fork to Add Multimodal Model Training to Open-R1.https://github.com/EvolvingLMMs-Lab/open-r1-multimodal, 2025

2025
[7]

Fault-Tolerant Federated Reinforcement Learning with Theoretical Guarantee

Flint Xiaofeng Fan, Yining Ma, Zhongxiang Dai, Wei Jing, Cheston Tan, and Bryan Kian Hsiang Low. Fault-Tolerant Federated Reinforcement Learning with Theoretical Guarantee. InProc. of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 1007–1021, 2021

2021
[8]

FedRLHF: A Convergence-Guaranteed Federated Framework for Privacy-Preserving and Per- sonalized RLHF

Flint Xiaofeng Fan, Cheston Tan, Yew-Soon Ong, Roger Wattenhofer, and Wei Tsang Ooi. FedRLHF: A Convergence-Guaranteed Federated Framework for Privacy-Preserving and Per- sonalized RLHF. InProc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 713–721, 2025

2025
[9]

Provably Robust Federated Rein- forcement Learning

Minghong Fang, Xilong Wang, and Neil Zhenqiang Gong. Provably Robust Federated Rein- forcement Learning. InProc. of the 2025 ACM on Web Conference (WWW), pages 896–909, 2025

2025
[10]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning.Nature, 645(8081):633–638, 2025

2025
[12]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In Proc. of the 10th International Conference on Learning Representations (ICLR), 2022

2022
[13]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 System Card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

FedHPD: Heterogeneous Federated Reinforcement Learning via Policy Distillation

Wenzheng Jiang, Ji Wang, Xiongtao Zhang, Weidong Bao, Cheston Tan, and Flint Xiaofeng Fan. FedHPD: Heterogeneous Federated Reinforcement Learning via Policy Distillation. In Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 2568–2570, 2025

2025
[15]

Federated Reinforcement Learning with Environment Heterogeneity

Hao Jin, Yang Peng, Wenhao Yang, Shusen Wang, and Zhihua Zhang. Federated Reinforcement Learning with Environment Heterogeneity. InProc. of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 18–37, 2022

2022
[16]

Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance and Provably Fast Convergence

Philip Jordan, Florian Grötschla, Flint Xiaofeng Fan, and Roger Wattenhofer. Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance and Provably Fast Convergence. In Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 964–972, 2024. 10

2024
[17]

Reddi, Sebastian U

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. InProc. of the 37th International Conference on Machine Learning (ICML), pages 5132–5143, 2020

2020
[18]

Federated Reinforce- ment Learning: Linear Speedup Under Markovian Sampling

Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated Reinforce- ment Learning: Linear Speedup Under Markovian Sampling. InProc. of the 39th International Conference on Machine Learning (ICML), pages 10997–11057, 2022

2022
[19]

Konda and John N

Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. InAdvances in Neural Information Processing Systems (NIPS), pages 1008–1014, 1999

1999
[20]

Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis

Guangchen Lan, Dong-Jun Han, Abolfazl Hashemi, Vaneet Aggarwal, and Christopher Brinton. Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis. InProc. of the 13th International Conference on Learning Representations (ICLR), 2025

2025
[21]

Federated Optimization in Heterogeneous Networks

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated Optimization in Heterogeneous Networks. InProc. of the 3rd Conference on Machine Learning and Systems (MLSys), 2020

2020
[22]

CPPO: Accelerating the Train- ing of Group Relative Policy Optimization-Based Reasoning Models.arXiv preprint arXiv:2503.22342, 2025

Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. CPPO: Accelerating the Train- ing of Group Relative Policy Optimization-Based Reasoning Models.arXiv preprint arXiv:2503.22342, 2025

work page arXiv 2025
[23]

Communication-Efficient Learning of Deep Networks from Decentralized Data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. InProc. of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 of Proceedings of Machine Learning Research, pages 1273–1282, 2017

2017
[24]

On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence.arXiv preprint arXiv:2508.02833, 2025

Lei Pang and Ruinan Jin. On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence.arXiv preprint arXiv:2508.02833, 2025

work page arXiv 2025
[25]

Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcný, Sanjiv Kumar, and Hugh Brendan McMahan

Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcný, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive Federated Optimization. InProc. of the 9th International Conference on Learning Representations (ICLR), 2021

2021
[26]

Federated Ensemble-Directed Offline Reinforcement Learning

Desik Rengarajan, Nitin Ragothaman, Dileep Kalathil, and Srinivas Shakkottai. Federated Ensemble-Directed Offline Reinforcement Learning. InProc. of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

2024
[27]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Momentum for the Win: Collaborative Federated Reinforcement Learning across Heterogeneous Environments

Han Wang, Sihong He, Zhili Zhang, Fei Miao, and James Anderson. Momentum for the Win: Collaborative Federated Reinforcement Learning across Heterogeneous Environments. InProc. of the 41st International Conference on Machine Learning (ICML), 2024

2024
[30]

The Blessing of Heterogeneity in Federated Q-Learning: Linear Speedup and Beyond

Jiin Woo, Gauri Joshi, and Yuejie Chi. The Blessing of Heterogeneity in Federated Q-Learning: Linear Speedup and Beyond. InProc. of the 40th International Conference on Machine Learning (ICML), pages 37157–37216, 2023

2023
[31]

Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices

Jiin Woo, Laixi Shi, Gauri Joshi, and Yuejie Chi. Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices. InProc. of the 41st International Conference on Machine Learning (ICML), 2024

2024
[32]

The Actor-Critic Update Order Matters for PPO in Federated Reinforcement Learning.arXiv preprint arXiv:2506.01261, 2025

Zhijie Xie and Shenghui Song. The Actor-Critic Update Order Matters for PPO in Federated Reinforcement Learning.arXiv preprint arXiv:2506.01261, 2025

work page arXiv 2025
[33]

On the Linear Speedup of Person- alized Federated Reinforcement Learning with Shared Representations

Guojun Xiong, Shufan Wang, Daniel Jiang, and Jian Li. On the Linear Speedup of Person- alized Federated Reinforcement Learning with Shared Representations. InProc. of the 13th International Conference on Learning Representations (ICLR), 2025

2025
[34]

Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning

Tong Yang, Shicong Cen, Yuting Wei, Yuxin Chen, and Yuejie Chi. Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning. InProc. of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024. 11

2024
[35]

On Classes of Summable Functions and Their Fourier Series.Proc

William Henry Young. On Classes of Summable Functions and Their Fourier Series.Proc. of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 87(594):225–229, 1912

1912
[36]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Finite-Time Analysis of On- Policy Heterogeneous Federated Reinforcement Learning

Chenyu Zhang, Han Wang, Aritra Mitra, and James Anderson. Finite-Time Analysis of On- Policy Heterogeneous Federated Reinforcement Learning. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

2024
[38]

GRPO-LEAD: A Difficulty-Aware Reinforcement Learn- ing Approach for Concise Mathematical Reasoning in Language Models.arXiv preprint arXiv:2504.09696, 2025

Jixiao Zhang and Chunsheng Zuo. GRPO-LEAD: A Difficulty-Aware Reinforcement Learn- ing Approach for Concise Mathematical Reasoning in Language Models.arXiv preprint arXiv:2504.09696, 2025

work page arXiv 2025
[39]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A Survey of Reinforcement Learning for Large Reasoning Models.arXiv preprint arXiv:2509.08827, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group Sequence Policy Optimization.arXiv preprint arXiv:2507.18071, 2025. 12 Contents 1 Introduction 1 2 System Model and Preliminaries 2 2.1 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Recent theoretical research has focused on establishing rigorous convergence guarantees under the unique constraints of sequential decision-making

across diverse clients to enhance sample efficiency without compromising raw data or trajectory privacy. Recent theoretical research has focused on establishing rigorous convergence guarantees under the unique constraints of sequential decision-making. [37] provides a fundamental finite-time analysis of on-policy FedRL under data heterogeneity, while [18]...
[42]

collaborative single-policy coverage

that a “collaborative single-policy coverage” condition, where the union of client data covers the optimal policy, is sufficient for global optimality. Furthermore, [ 33] highlights that shared representation learning can further accelerate convergence by extracting collaborative features across diverse tasks, while [34] proposes federated natural policy ...
[43]

Similarly, DAPO [36] introduces a decoupled and dynamic sampling system designed to stabilize long Chain-of-Thought (CoT) reasoning

addresses the issues of verbosity and sparsity by integrating length-regularized rewards and difficulty-aware advantage reweighting, which ensures robust generalization on challenging problems. Similarly, DAPO [36] introduces a decoupled and dynamic sampling system designed to stabilize long Chain-of-Thought (CoT) reasoning. By employing asymmetric clippi...

[1] [1]

Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. InProc. of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 12248–12267, 2024

2024

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

R1-V: Reinforcing Super Gen- eralization Ability in Vision-Language Models with Less Than $3

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-V: Reinforcing Super Gen- eralization Ability in Vision-Language Models with Less Than $3. https://github.com/ Deep-Agent/R1-V, 2025. Accessed: 2025-02-02

2025

[5] [5]

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025a

Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization.arXiv preprint arXiv:2505.12346, 2025

work page arXiv 2025

[6] [6]

Open-R1-Multimodal: A Fork to Add Multimodal Model Training to Open-R1.https://github.com/EvolvingLMMs-Lab/open-r1-multimodal, 2025

EvolvingLMMs-Lab. Open-R1-Multimodal: A Fork to Add Multimodal Model Training to Open-R1.https://github.com/EvolvingLMMs-Lab/open-r1-multimodal, 2025

2025

[7] [7]

Fault-Tolerant Federated Reinforcement Learning with Theoretical Guarantee

Flint Xiaofeng Fan, Yining Ma, Zhongxiang Dai, Wei Jing, Cheston Tan, and Bryan Kian Hsiang Low. Fault-Tolerant Federated Reinforcement Learning with Theoretical Guarantee. InProc. of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 1007–1021, 2021

2021

[8] [8]

FedRLHF: A Convergence-Guaranteed Federated Framework for Privacy-Preserving and Per- sonalized RLHF

Flint Xiaofeng Fan, Cheston Tan, Yew-Soon Ong, Roger Wattenhofer, and Wei Tsang Ooi. FedRLHF: A Convergence-Guaranteed Federated Framework for Privacy-Preserving and Per- sonalized RLHF. InProc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 713–721, 2025

2025

[9] [9]

Provably Robust Federated Rein- forcement Learning

Minghong Fang, Xilong Wang, and Neil Zhenqiang Gong. Provably Robust Federated Rein- forcement Learning. InProc. of the 2025 ACM on Web Conference (WWW), pages 896–909, 2025

2025

[10] [10]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning.Nature, 645(8081):633–638, 2025

2025

[12] [12]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In Proc. of the 10th International Conference on Learning Representations (ICLR), 2022

2022

[13] [13]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 System Card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

FedHPD: Heterogeneous Federated Reinforcement Learning via Policy Distillation

Wenzheng Jiang, Ji Wang, Xiongtao Zhang, Weidong Bao, Cheston Tan, and Flint Xiaofeng Fan. FedHPD: Heterogeneous Federated Reinforcement Learning via Policy Distillation. In Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 2568–2570, 2025

2025

[15] [15]

Federated Reinforcement Learning with Environment Heterogeneity

Hao Jin, Yang Peng, Wenhao Yang, Shusen Wang, and Zhihua Zhang. Federated Reinforcement Learning with Environment Heterogeneity. InProc. of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 18–37, 2022

2022

[16] [16]

Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance and Provably Fast Convergence

Philip Jordan, Florian Grötschla, Flint Xiaofeng Fan, and Roger Wattenhofer. Decentralized Federated Policy Gradient with Byzantine Fault-Tolerance and Provably Fast Convergence. In Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 964–972, 2024. 10

2024

[17] [17]

Reddi, Sebastian U

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. InProc. of the 37th International Conference on Machine Learning (ICML), pages 5132–5143, 2020

2020

[18] [18]

Federated Reinforce- ment Learning: Linear Speedup Under Markovian Sampling

Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated Reinforce- ment Learning: Linear Speedup Under Markovian Sampling. InProc. of the 39th International Conference on Machine Learning (ICML), pages 10997–11057, 2022

2022

[19] [19]

Konda and John N

Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. InAdvances in Neural Information Processing Systems (NIPS), pages 1008–1014, 1999

1999

[20] [20]

Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis

Guangchen Lan, Dong-Jun Han, Abolfazl Hashemi, Vaneet Aggarwal, and Christopher Brinton. Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis. InProc. of the 13th International Conference on Learning Representations (ICLR), 2025

2025

[21] [21]

Federated Optimization in Heterogeneous Networks

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated Optimization in Heterogeneous Networks. InProc. of the 3rd Conference on Machine Learning and Systems (MLSys), 2020

2020

[22] [22]

CPPO: Accelerating the Train- ing of Group Relative Policy Optimization-Based Reasoning Models.arXiv preprint arXiv:2503.22342, 2025

Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. CPPO: Accelerating the Train- ing of Group Relative Policy Optimization-Based Reasoning Models.arXiv preprint arXiv:2503.22342, 2025

work page arXiv 2025

[23] [23]

Communication-Efficient Learning of Deep Networks from Decentralized Data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. InProc. of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 of Proceedings of Machine Learning Research, pages 1273–1282, 2017

2017

[24] [24]

On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence.arXiv preprint arXiv:2508.02833, 2025

Lei Pang and Ruinan Jin. On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence.arXiv preprint arXiv:2508.02833, 2025

work page arXiv 2025

[25] [25]

Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcný, Sanjiv Kumar, and Hugh Brendan McMahan

Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcný, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive Federated Optimization. InProc. of the 9th International Conference on Learning Representations (ICLR), 2021

2021

[26] [26]

Federated Ensemble-Directed Offline Reinforcement Learning

Desik Rengarajan, Nitin Ragothaman, Dileep Kalathil, and Srinivas Shakkottai. Federated Ensemble-Directed Offline Reinforcement Learning. InProc. of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

2024

[27] [27]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Momentum for the Win: Collaborative Federated Reinforcement Learning across Heterogeneous Environments

Han Wang, Sihong He, Zhili Zhang, Fei Miao, and James Anderson. Momentum for the Win: Collaborative Federated Reinforcement Learning across Heterogeneous Environments. InProc. of the 41st International Conference on Machine Learning (ICML), 2024

2024

[30] [30]

The Blessing of Heterogeneity in Federated Q-Learning: Linear Speedup and Beyond

Jiin Woo, Gauri Joshi, and Yuejie Chi. The Blessing of Heterogeneity in Federated Q-Learning: Linear Speedup and Beyond. InProc. of the 40th International Conference on Machine Learning (ICML), pages 37157–37216, 2023

2023

[31] [31]

Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices

Jiin Woo, Laixi Shi, Gauri Joshi, and Yuejie Chi. Federated Offline Reinforcement Learning: Collaborative Single-Policy Coverage Suffices. InProc. of the 41st International Conference on Machine Learning (ICML), 2024

2024

[32] [32]

The Actor-Critic Update Order Matters for PPO in Federated Reinforcement Learning.arXiv preprint arXiv:2506.01261, 2025

Zhijie Xie and Shenghui Song. The Actor-Critic Update Order Matters for PPO in Federated Reinforcement Learning.arXiv preprint arXiv:2506.01261, 2025

work page arXiv 2025

[33] [33]

On the Linear Speedup of Person- alized Federated Reinforcement Learning with Shared Representations

Guojun Xiong, Shufan Wang, Daniel Jiang, and Jian Li. On the Linear Speedup of Person- alized Federated Reinforcement Learning with Shared Representations. InProc. of the 13th International Conference on Learning Representations (ICLR), 2025

2025

[34] [34]

Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning

Tong Yang, Shicong Cen, Yuting Wei, Yuxin Chen, and Yuejie Chi. Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning. InProc. of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), 2024. 11

2024

[35] [35]

On Classes of Summable Functions and Their Fourier Series.Proc

William Henry Young. On Classes of Summable Functions and Their Fourier Series.Proc. of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 87(594):225–229, 1912

1912

[36] [36]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Finite-Time Analysis of On- Policy Heterogeneous Federated Reinforcement Learning

Chenyu Zhang, Han Wang, Aritra Mitra, and James Anderson. Finite-Time Analysis of On- Policy Heterogeneous Federated Reinforcement Learning. InProc. of the 12th International Conference on Learning Representations (ICLR), 2024

2024

[38] [38]

GRPO-LEAD: A Difficulty-Aware Reinforcement Learn- ing Approach for Concise Mathematical Reasoning in Language Models.arXiv preprint arXiv:2504.09696, 2025

Jixiao Zhang and Chunsheng Zuo. GRPO-LEAD: A Difficulty-Aware Reinforcement Learn- ing Approach for Concise Mathematical Reasoning in Language Models.arXiv preprint arXiv:2504.09696, 2025

work page arXiv 2025

[39] [39]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A Survey of Reinforcement Learning for Large Reasoning Models.arXiv preprint arXiv:2509.08827, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group Sequence Policy Optimization.arXiv preprint arXiv:2507.18071, 2025. 12 Contents 1 Introduction 1 2 System Model and Preliminaries 2 2.1 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Recent theoretical research has focused on establishing rigorous convergence guarantees under the unique constraints of sequential decision-making

across diverse clients to enhance sample efficiency without compromising raw data or trajectory privacy. Recent theoretical research has focused on establishing rigorous convergence guarantees under the unique constraints of sequential decision-making. [37] provides a fundamental finite-time analysis of on-policy FedRL under data heterogeneity, while [18]...

[42] [42]

collaborative single-policy coverage

that a “collaborative single-policy coverage” condition, where the union of client data covers the optimal policy, is sufficient for global optimality. Furthermore, [ 33] highlights that shared representation learning can further accelerate convergence by extracting collaborative features across diverse tasks, while [34] proposes federated natural policy ...

[43] [43]

Similarly, DAPO [36] introduces a decoupled and dynamic sampling system designed to stabilize long Chain-of-Thought (CoT) reasoning

addresses the issues of verbosity and sparsity by integrating length-regularized rewards and difficulty-aware advantage reweighting, which ensures robust generalization on challenging problems. Similarly, DAPO [36] introduces a decoupled and dynamic sampling system designed to stabilize long Chain-of-Thought (CoT) reasoning. By employing asymmetric clippi...