arxiv: 2605.08378 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

Guangchen Lan

Pith reviewed 2026-05-12 01:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords learningreinforcementdissertationintelligentsystemstrustworthyfederatedoptimization

0 comments

The pith

Reinforcement learning provides a unifying framework for both scalable optimization in distributed settings and trustworthy behavior aligned with human preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The dissertation identifies two central challenges for reinforcement learning in intelligent systems: efficient scaling in federated environments with limited communication and heterogeneous computation, and ensuring policies align with human preferences while meeting safety requirements such as privacy-aware disclosure. It addresses these through four complementary contributions in federated optimization, preference alignment, and contextual safety. A sympathetic reader would care because applications like large language models and autonomous agents increasingly rely on reinforcement learning for post-training yet face practical barriers in both efficiency and trust. If the contributions succeed, they would enable deployment of reinforcement learning where both computational constraints and alignment demands must be met simultaneously. The work as a whole claims that reinforcement learning supplies the necessary tools for the next generation of such systems.

Core claim

Reinforcement learning has become a powerful paradigm for improving the capability of intelligent systems, but its practical deployment faces two central challenges: scaling efficiently in distributed environments where communication bandwidth is limited and computation is heterogeneous across agents, and ensuring optimized policies align with human preferences and satisfy safety requirements such as privacy-aware information disclosure. This dissertation addresses both challenges through four complementary contributions spanning federated optimization, preference alignment, and contextual safety. Together these contributions make reinforcement learning more scalable through communication-�c

What carries the argument

The four complementary contributions in federated optimization and preference alignment that advance reinforcement learning along the dimensions of communication-efficient scalability and human-preference alignment.

If this is right

Reinforcement learning policies can be optimized efficiently in federated settings with limited communication bandwidth and heterogeneous computation across agents.
Optimized policies for large language models can be aligned with human preferences.
Intelligent systems can satisfy safety requirements such as privacy-aware information disclosure in language-based agents.
Reinforcement learning supplies a single framework that addresses both efficient optimization and trustworthy behavior in next-generation intelligent systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to real-time decision systems in robotics or autonomous vehicles where similar federated and alignment methods might ensure both low-latency operation and ethical constraints.
If the methods prove robust, they might inform regulatory standards for deploying reinforcement learning in safety-critical distributed networks.
A natural next test would be whether these techniques combine with existing post-training methods like RLHF to produce measurable gains in multi-agent scenarios.
The work leaves open whether the same unification holds when scaling to very large models or highly dynamic environments beyond the dissertation's focus.
keywords
msc
pacs
feed_headline

Load-bearing premise

That the four unspecified complementary contributions in federated optimization and preference alignment actually deliver measurable improvements in scalability and safety.

What would settle it

Empirical results from the proposed federated algorithms showing no reduction in communication costs or alignment techniques that fail to reduce unsafe disclosures in language-model outputs would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.08378 by Guangchen Lan.

**Figure 2.1.** Figure 2.1: An illustration of federated learning based on second-order methods with N agents. (a) FedNPG via standard average. In the uplink, transmitting the matrix Hi brings O(d 2 ) communication complexity. (b) FedNPG-ADMM in this paper with only O(d) communication complexity. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_2_1.png] view at source ↗

**Figure 2.2.** Figure 2.2: Reward performances of standard average FedNPG and FedNPGADMM on MuJoCo tasks, where N is the number of federated agents. Top: Swimmer-v4, Bottom: Hopper-v4. Left: FedNPG with O(d 2 ) communication complexity, Right: FedNPG-ADMM with O(d) communication complexity [PITH_FULL_IMAGE:figures/full_fig_p032_2_2.png] view at source ↗

**Figure 2.3.** Figure 2.3: Comparisons of FedPPO, standard average FedNPG, and FedNPGADMM on MuJoCo tasks, where the number of federated agents N is 8 and the communication overhead is measured by the transmitted bytes in each agent. Left: Swimmer-v4 task, Right: Humanoid-v4 task, Top: Reward performances, Bottom: Communication overhead [PITH_FULL_IMAGE:figures/full_fig_p033_2_3.png] view at source ↗

**Figure 2.4.** Figure 2.4: Reward performances of FedNPG-ADMM on the Swimmer-v4 task with agent selection. In each iteration, the server randomly selects 100%, 75%, and 50% of agents for the aggregation. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_2_4.png] view at source ↗

**Figure 3.1.** Figure 3.1: An illustration of the asynchronous federated policy gradient updates. Each agent has a local copy of the environment, and agents may collect data according to different local policies. At each iteration, the agent in the yellow color finishes the local process and then communicates with the server, while the other agents keep sampling and computing local gradients in parallel. In the k-th global iterati… view at source ↗

**Figure 3.2.** Figure 3.2: Visualization of the four MuJoCo tasks considered in this paper for experiments. Environment: To validate the effectiveness of our approach via experiments, we consider four popular MuJoCo environments for robotic control (Swimmer-v4, Hopper-v4, Walker2Dv4, and Humanoid-v4) [52] with the MIT License. Both the state and action spaces are continuous. Environmental details are described in Table B.1 in App… view at source ↗

**Figure 3.3.** Figure 3.3: Reward performances of AFedPG (N = 2, 4, 8) and PG (N = 1) on various MuJoCo environments, where N is the number of federated agents. The solid lines are averaged results over 10 runs with random seeds from 0 to 9. The shadowed areas are confidence intervals with 95% confidence level. and the shadowed areas are confidence intervals with the confidence level 95%. The lines are smoothed for better visualiz… view at source ↗

**Figure 3.4.** Figure 3.4: Global time of AFedPG and FedPG with certain numbers of collected samples on various MuJoCo environments, where N is the number of federated agents. The solid lines are averaged results over 10 runs. The shadowed areas are confidence intervals with 95% confidence level. Baselines: We first consider the conventional PG approach with N = 1, to see the effect of using multiple agents for improving sample co… view at source ↗

**Figure 4.1.** Figure 4.1: An example of (x, yw, yl) pair. Both responses yw and yl have good quality as they achieve high rewards, where r(x, yw) = 0.95, r(x, yl) = 0.91, and r ∈ [0, 1]. Both yw and yl have high rewards, which reflect the high qualities. However, in MLE and its derived DPO, the learning objective is nothing but to increase the gap between yw and yl , regardless of the fact that both of them have high qualities wi… view at source ↗

**Figure 4.2.** Figure 4.2: Under the standard MLE-based DPO (left), empirical studies [84– 86, 88] demonstrated that training tends to simultaneously downscale (with different magnitudes) both the chosen and rejected responses to increase their gap. Our MaP-based method (right) mitigates this harmful tendency by reweighting the rejected response based on prior knowledge. Here, the x-axis denotes the initial model θ0 and a potenti… view at source ↗

**Figure 4.3.** Figure 4.3: Illustration of the iterative MaPPO pipeline in each iteration k. With a prompt set D, we can equally divide D into K subsets as D1 · · · DK. In the k-th iteration, we first freeze the current policy model πθ, and then get responses (y1, y2) from the policy according to the prompt set Dk. We then use a reward model to get the responses’ corresponding rewards and collect (yw, yl) pairs, which reflect the … view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p074_5.png] view at source ↗

**Figure 5.1.** Figure 5.1: Contextual integrity (CI) violations in agents arise when they fail to recognize the appropriateness of the sharing of background information for a given context. We propose a framework that explicitly reasons about the contextual appropriateness of each user attribute. In this context, the attributes in green are appropriate to share whereas the attributes in red are inappropriate. In this illustration,… view at source ↗

**Figure 5.2.** Figure 5.2: Prompt template for contextual integrity reasoning. Seed scenario, domain, transmission principle Vignette actors + CI slots Dataset item {task, info, annotation} [PITH_FULL_IMAGE:figures/full_fig_p081_5_2.png] view at source ↗

read the original abstract

Reinforcement learning has become a powerful paradigm for improving the capability of intelligent systems, but its practical deployment faces two central challenges. First, reinforcement learning must scale efficiently in distributed environments where communication bandwidth is limited and computation is heterogeneous across agents. Second, as reinforcement learning is increasingly used in post-training large language models and autonomous agents, the optimized policies must also be aligned with human preferences and satisfy safety requirements such as privacy-aware information disclosure. This dissertation addresses both challenges through four complementary contributions spanning federated optimization, preference alignment, and contextual safety. The first part of the dissertation studies scalable reinforcement learning in federated settings. The second part of the dissertation studies trustworthy reinforcement learning for large language models. Together, these contributions advance reinforcement learning along two complementary dimensions. On the one hand, they make reinforcement learning more scalable through communication-efficient and asynchronous federated optimization. On the other hand, they make reinforcement learning more trustworthy by improving alignment with human preferences and by reducing contextually inappropriate information disclosure in language-based intelligent systems. As a whole, this dissertation argues that the next generation of intelligent systems will require both efficient optimization and trustworthy behavior, and that reinforcement learning provides a unifying framework for addressing both goals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This dissertation abstract frames RL as a unifying approach for scalable federated optimization and trustworthy LLM alignment, but supplies no methods, results, or comparisons to assess the claim.

read the letter

This dissertation abstract argues that reinforcement learning offers a way to tackle both scalability in distributed settings and trustworthiness in language model alignment. That is the main takeaway. The author structures the work into two parts, one on federated optimization for efficiency and one on preference alignment and safety for LLMs. It correctly points out real issues: communication constraints in federated RL and the need for safe, preference-aligned behavior in LLM post-training. Bringing those together as complementary goals is a fair position, and the abstract does not overclaim beyond that. The problem is that nothing specific is provided to back any of it up. There are no algorithms, no experimental setups, no quantitative claims, and no comparisons to existing literature such as prior federated RL papers or RLHF methods. The four complementary contributions are described only at the level of topics covered, with no equations or findings. Because of that, it is hard to know what is new here or whether the approaches work as stated. The text stays consistent internally and avoids contradictions, but offers no evidence to support the unification claim or to show measurable improvements. Readers interested in research directions for practical RL systems might skim this for the problem statement and the broad framing. Anyone needing technical details or verifiable advances will not find them. It does not look like it merits sending out for peer review in its current form, as there is no substance for referees to assess. If this is meant to summarize a full thesis, the chapters would need to be evaluated separately.

Referee Report

1 major / 0 minor

Summary. This dissertation argues that reinforcement learning provides a unifying framework for next-generation intelligent systems by addressing two challenges: efficient scaling in distributed federated environments with limited communication and heterogeneous computation, and trustworthiness in post-training of large language models and autonomous agents via human preference alignment and safety constraints such as privacy-aware disclosure. It structures the work into two parts—scalable federated RL through communication-efficient and asynchronous optimization, and trustworthy RL for LLMs—with four complementary contributions spanning federated optimization, preference alignment, and contextual safety.

Significance. If the four contributions deliver measurable gains in scalability and safety, the work could help unify efficiency and alignment research in RL applications to modern AI. The high-level framing is conceptually coherent with ongoing trends in federated learning and LLM post-training, but the manuscript offers no methods, experiments, quantitative results, or error analysis to substantiate the claims, leaving the significance prospective rather than established.

major comments (1)

[Abstract] Abstract and overall structure: the manuscript asserts that the four unspecified contributions in federated optimization, preference alignment, and contextual safety 'advance reinforcement learning along two complementary dimensions' and realize a unifying framework, yet provides no algorithms, derivations, experimental protocols, datasets, metrics, or results to support these assertions. This absence is load-bearing for the central claim, as the thesis reduces to an unverified assertion that applying RL to both problem classes advances both goals.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We agree that the abstract requires revision to better substantiate its claims with references to the specific methods and results from the four contributions.

read point-by-point responses

Referee: [Abstract] Abstract and overall structure: the manuscript asserts that the four unspecified contributions in federated optimization, preference alignment, and contextual safety 'advance reinforcement learning along two complementary dimensions' and realize a unifying framework, yet provides no algorithms, derivations, experimental protocols, datasets, metrics, or results to support these assertions. This absence is load-bearing for the central claim, as the thesis reduces to an unverified assertion that applying RL to both problem classes advances both goals.

Authors: We acknowledge the referee's observation that the submitted abstract summarizes the contributions at a high level without including the supporting technical details. The full dissertation contains four specific contributions, each with algorithms (including communication-efficient and asynchronous federated RL methods), derivations, experimental protocols, datasets, metrics, and quantitative results demonstrating scalability gains and improved alignment/safety. To address this directly, we will revise the abstract to name the four contributions explicitly and briefly reference their key methods and empirical outcomes, thereby grounding the unifying framework claim. We will also update the overall structure description to cross-reference the detailed chapters. revision: yes

Circularity Check

0 steps flagged

No circularity: high-level thesis with no derivations or equations

full rationale

The manuscript is a dissertation abstract and high-level position statement asserting that reinforcement learning supplies a unifying framework for scalable federated optimization and trustworthy preference alignment/safety. No equations, first-principles derivations, fitted parameters, or quantitative predictions appear in the provided text. The central claim is an organizational assertion about complementary contributions rather than a mathematical reduction; it does not define any quantity in terms of itself, rename a known result, or rely on self-citation chains for load-bearing steps. The argument is therefore self-contained at the level of scope description and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical content is present; the abstract introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5502 in / 925 out tokens · 37717 ms · 2026-05-12T01:43:14.819874+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

258 extracted references · 258 canonical work pages · 23 internal anchors

[1]

Simple statistical gradient-following algorithms for connectionist reinforcement learning,

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine learning, vol. 8, pp. 229–256, 1992

work page 1992
[2]

DeepPool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning,

A. O. Al-Abbasi, A. Ghosh, and V. Aggarwal, “DeepPool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning,”IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 12, pp. 4714–4727, 2019

work page 2019
[3]

A reinforcement learning framework for vehicular network routing under peak and average constraints,

N. Geng et al., “A reinforcement learning framework for vehicular network routing under peak and average constraints,”IEEE Transactions on Vehicular Technology, vol. 72, no. 5, pp. 6753–6764, 2023

work page 2023
[4]

Two-tiered online optimization of region-wide datacenter resource allocation via deep reinforcement learning,

C.-L. Chen et al., “Two-tiered online optimization of region-wide datacenter resource allocation via deep reinforcement learning,”arXiv preprint arXiv:2306.17054, 2023

work page arXiv 2023
[5]

ASAP: A semi-autonomous precise system for telesurgery during communication delays,

G. Gonzalez et al., “ASAP: A semi-autonomous precise system for telesurgery during communication delays,”IEEE Transactions on Medical Robotics and Bionics, vol. 5, no. 1, pp. 66–78, 2023

work page 2023
[6]

Liquid-augmented MPC in quadrupedal robot for disturbance learning,

Y. Mao, Y. Zhang, and L. Gao, “Liquid-augmented MPC in quadrupedal robot for disturbance learning,”Electronics, vol. 14, no. 24, p. 4843, 2025

work page 2025
[7]

Invertible liquid neural network-based learning of inverse kinematics and dynamics for robotic manipulators,

Y. Zhang et al., “Invertible liquid neural network-based learning of inverse kinematics and dynamics for robotic manipulators,”Scientific Reports, vol. 15, no. 1, p. 42311, 2025

work page 2025
[8]

Training language models to follow instructions with human feedback,

L. Ouyang et al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, New Orleans, LA, USA: Curran Associates, Inc., Nov. 2022, pp. 27730–27744

work page 2022
[9]

GPT-4 Technical Report

OpenAI, “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Gemma: Open Models Based on Gemini Research and Technology

G. Team and G. DeepMind, “Gemma: Open models based on Gemini research and technology,”arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Data science and its relationship to big data and data-driven decision making,

F. Provost and T. Fawcett, “Data science and its relationship to big data and data-driven decision making,”Big Data, vol. 1, no. 1, pp. 51–59, 2013

work page 2013
[12]

Distributed learning in wireless sensor networks,

J. B. Predd, S. B. Kulkarni, and H. V. Poor, “Distributed learning in wireless sensor networks,”IEEE Signal Processing Magazine, vol. 23, no. 4, pp. 56–69, 2006

work page 2006
[13]

Federated learning for wireless communica- tions: Motivation, opportunities, and challenges,

S. Niknam, H. S. Dhillon, and J. H. Reed, “Federated learning for wireless communica- tions: Motivation, opportunities, and challenges,”IEEE Communications Magazine, vol. 58, no. 6, pp. 46–51, 2020. 100

work page 2020
[14]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,”Advances in Neural Information processing Systems (NeurIPS), vol. 30, 2017

work page 2017
[15]

Nissenbaum,Privacy in context: Technology, policy, and the integrity of social life

H. Nissenbaum,Privacy in context: Technology, policy, and the integrity of social life. Stanford University Press, 2009

work page 2009
[16]

Anaturalpolicygradient,

S.M.Kakade,“Anaturalpolicygradient,”inAdvances in Neural Information Processing Systems (NIPS), vol. 14, Vancouver, Canada: MIT Press, 2001

work page 2001
[17]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Trust region policy optimization,

J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,”arXiv preprint arXiv:1502.05477, 2017

work page arXiv 2017
[19]

Communication- efficient learning of deep networks from decentralized data,

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication- efficient learning of deep networks from decentralized data,” inInternational Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learn- ing Research, vol. 54, Ft. Lauderdale, FL, USA: PMLR, Apr. 2017, pp. 1273–1282

work page 2017
[20]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, pp. 53728–53741

work page 2023
[21]

Can LLMs keep a secret? testing privacy implications of lan- guage models via contextual integrity theory,

N. Mireshghallah et al., “Can LLMs keep a secret? testing privacy implications of lan- guage models via contextual integrity theory,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[22]

PrivacyLens: Evaluating privacy norm awareness of language models in action,

Y. Shao, T. Li, W. Shi, Y. Liu, and D. Yang, “PrivacyLens: Evaluating privacy norm awareness of language models in action,” inThe Thirty-eight Conference on Neural Information Processing (NeurIPS) Systems Datasets and Benchmarks Track, 2024

work page 2024
[23]

Communication-efficient federated learning for resource-constrained edge devices,

G. Lan, X.-Y. Liu, Y. Zhang, and X. Wang, “Communication-efficient federated learning for resource-constrained edge devices,”IEEE Transactions on Machine Learning in Communications and Networking, vol. 1, pp. 210–224, 2023

work page 2023
[24]

Improved communication efficiency in federated natural policy gradient via ADMM-based gradient updates,

G. Lan, H. Wang, J. Anderson, C. Brinton, and V. Aggarwal, “Improved communication efficiency in federated natural policy gradient via ADMM-based gradient updates,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024

work page 2024
[25]

Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis,

G. Lan, D.-J. Han, A. Hashemi, V. Aggarwal, and C. Brinton, “Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis,” inThe Thirteenth International Conference on Learning Representations (ICLR), 2025. 101

work page 2025
[26]

Privacy and contextual integrity: Framework and applications,

A. Barth, A. Datta, J. C. Mitchell, and H. Nissenbaum, “Privacy and contextual integrity: Framework and applications,” inIEEE symposium on security and privacy (S&P), 2006

work page 2006
[27]

Contextual integrity in LLMs via reasoning and reinforcement learning,

G. Lan et al., “Contextual integrity in LLMs via reasoning and reinforcement learning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[28]

FedNew: A communication-efficient and privacy-preserving Newton-type method for federated learning,

A. Elgabli, C. B. Issaid, A. S. Bedi, K. Rajawat, M. Bennis, and V. Aggarwal, “FedNew: A communication-efficient and privacy-preserving Newton-type method for federated learning,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 162, Baltimore, MD, USA: PMLR, Jul. 2022, pp. 5861–5877

work page 2022
[29]

Communication-efficient distributed optimization using an approximate Newton-type method,

O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 32, Bejing, China: PMLR, Jun. 2014, pp. 1000–1008

work page 2014
[30]

FedNL: Making Newton-type methods applicable to federated learning,

M. Safaryan, R. Islamov, X. Qian, and P. Richtarik, “FedNL: Making Newton-type methods applicable to federated learning,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 162, Baltimore, MD, USA: PMLR, Jul. 2022, pp. 18959–19010

work page 2022
[31]

Basismatters:Bettercommunication- efficient second order methods for federated learning,

X.Qian,R.Islamov,M.Safaryan,andP.Richtarik,“Basismatters:Bettercommunication- efficient second order methods for federated learning,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Pro- ceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 680–720

work page 2022
[32]

Distributed second order methods with fast rates and compressed communication,

R. Islamov, X. Qian, and P. Richtarik, “Distributed second order methods with fast rates and compressed communication,” inProceedings of the 38th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 139, PMLR, Jul. 2021, pp. 4617–4628

work page 2021
[33]

Improved communication efficiency in federated natural policy gradient via ADMM-based gradient updates,

G. Lan, H. Wang, J. Anderson, C. Brinton, and V. Aggarwal, “Improved communication efficiency in federated natural policy gradient via ADMM-based gradient updates,” inThirty-seventh Conference on Neural Information Processing Systems (NeurIPS), vol. 36, New Orleans, LA, USA: Curran Associates, Inc., Dec. 2023, pp. 59873–59885

work page 2023
[34]

GADMM: Fast and com- munication efficient framework for distributed machine learning,

A. Elgabli, J. Park, A. S. Bedi, M. Bennis, and V. Aggarwal, “GADMM: Fast and com- munication efficient framework for distributed machine learning,”Journal of Machine Learning Research, vol. 21, no. 76, pp. 1–39, 2020

work page 2020
[35]

Q-GADMM: Quantized group ADMM for communication efficient decentralized machine learning,

A. Elgabli, J. Park, A. S. Bedi, C. B. Issaid, M. Bennis, and V. Aggarwal, “Q-GADMM: Quantized group ADMM for communication efficient decentralized machine learning,” IEEE Transactions on Communications, vol. 69, no. 1, pp. 164–181, 2020. 102

work page 2020
[36]

FedADMM: A federated primal-dual algorithm allowing partial participation,

H. Wang, S. Marella, and J. Anderson, “FedADMM: A federated primal-dual algorithm allowing partial participation,” inIEEE 61st Conference on Decision and Control (CDC), Cancun, Mexico, 2022

work page 2022
[37]

Federated deep reinforcement learning,

H. H. Zhuo, W. Feng, Y. Lin, Q. Xu, and Q. Yang, “Federated deep reinforcement learning,”arXiv preprint arXiv:1901.08277, 2020

work page arXiv 1901
[38]

arXiv preprint arXiv:2108.11887 , year =

J. Qi, Q. Zhou, L. Lei, and K. Zheng, “Federated reinforcement learning: Techniques, applications, and open challenges,”arXiv preprint arXiv:2108.11887, 2021

work page arXiv 2021
[39]

Federated reinforcement learning: Linear speedup under Markovian sampling,

S. Khodadadian, P. Sharma, G. Joshi, and S. T. Maguluri, “Federated reinforcement learning: Linear speedup under Markovian sampling,” inInternational Conference on Machine Learning (ICML), vol. 162, Baltimore, MD, USA: PMLR, 2022, pp. 10997– 11057

work page 2022
[40]

Federated reinforcement learning with environment heterogeneity,

H. Jin, Y. Peng, W. Yang, S. Wang, and Z. Zhang, “Federated reinforcement learning with environment heterogeneity,” inInternational Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 18–37

work page 2022
[41]

Federated TD learning with linear function approximation under environmental heterogeneity,

H. Wang, A. Mitra, H. Hassani, G. J. Pappas, and J. Anderson, “Federated TD learning with linear function approximation under environmental heterogeneity,”Transactions on Machine Learning Research, 2024,issn: 2835-8856

work page 2024
[42]

FedKL: Tackling data heterogeneity in federated reinforcement learning by penalizing KL divergence,

Z. Xie and S. Song, “FedKL: Tackling data heterogeneity in federated reinforcement learning by penalizing KL divergence,”IEEE Journal on Selected Areas in Communi- cations, vol. 41, no. 4, pp. 1227–1242, 2023

work page 2023
[43]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional contin- uous control using generalized advantage estimation,”arXiv preprint arXiv:1506.02438, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

On the theory of policy gradient methods: Optimality, approximation, and distribution shift,

A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,”Journal of Machine Learning Research, vol. 22, no. 1, pp. 4431–4506, 2021

work page 2021
[45]

MinibatchvslocalSGDforheterogeneous distributed learning,

B.E.Woodworth,K.K.Patel,andN.Srebro,“MinibatchvslocalSGDforheterogeneous distributed learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, Curran Associates, Inc., 2020, pp. 6281–6292

work page 2020
[46]

[Online]

speedtest.net,Speedtest market report in the United States, 2023. [Online]. Available: http://www.speedtest.net/reports/united-states

work page 2023
[47]

Distributed optimization and statistical learning via the alternating direction method of multipliers,

S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,”Foundations and Trends®in Machine learning, vol. 3, no. 1, pp. 1–122, 2011. 103

work page 2011
[48]

Natural policy gradient primal-dual method for constrained Markov decision processes,

D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained Markov decision processes,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, Curran Associates, Inc., 2020, pp. 8378–8390

work page 2020
[49]

An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods,

Y. Liu, K. Zhang, T. Basar, and W. Yin, “An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, Curran Associates, Inc., 2020, pp. 7624–7636

work page 2020
[50]

Stochastic variance- reducedpolicygradient,

M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli, “Stochastic variance- reducedpolicygradient,”inProceedings of the 35th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 80, Stockholm, Sweden: PMLR, Jul. 2018, pp. 4026–4035

work page 2018
[51]

Sample efficient policy gradient methods with recursive variance reduction,

P. Xu, F. Gao, and Q. Gu, “Sample efficient policy gradient methods with recursive variance reduction,” inInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[52]

MuJoCo: A physics engine for model-based control,

E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura-Algarve, Portugal, 2012, pp. 5026–5033

work page 2012
[53]

Stable- baselines3: Reliable reinforcement learning implementations,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable- baselines3: Reliable reinforcement learning implementations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021

work page 2021
[54]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[55]

PyTorch: An imperative style, high-performance deep learning library,

A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, Vancouver, Canada: Curran Associates, Inc., Dec. 2019

work page 2019
[56]

Personalized federated reinforcement learn- ing with shared representations,

G. Xiong, S. Wang, D. Jiang, and J. Li, “Personalized federated reinforcement learn- ing with shared representations,” inDeployable RL: From Research to Practice @ Reinforcement Learning Conference, 2024

work page 2024
[57]

Asynchronous federated optimization,

C. Xie, S. Koyejo, and I. Gupta, “Asynchronous federated optimization,” in12th Annual Workshop on Optimization for Machine Learning (OPT), 2020

work page 2020
[58]

VAFL: A method of vertical asynchronous federated learning,

T. Chen, X. Jin, Y. Sun, and W. Yin, “VAFL: A method of vertical asynchronous federated learning,”arXiv preprint arXiv:2007.06081, 2020

work page arXiv 2007
[59]

Single-forking of coded subtasks for straggler mitigation,

A. Badita, P. Parag, and V. Aggarwal, “Single-forking of coded subtasks for straggler mitigation,”IEEE/ACM Transactions on Networking, vol. 29, no. 6, pp. 2413–2424, 2021. 104

work page 2021
[60]

Asynchronous SGD beats minibatch SGD under arbitrary delays,

K. Mishchenko, F. Bach, M. Even, and B. E. Woodworth, “Asynchronous SGD beats minibatch SGD under arbitrary delays,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, New Orleans, LA, USA: PMLR, Nov. 2022, pp. 420–433

work page 2022
[61]

Asynchronous multi- model dynamic federated learning over wireless networks: Theory, modeling, and optimization,

Z.-L. Chang, S. Hosseinalipour, M. Chiang, and C. G. Brinton, “Asynchronous multi- model dynamic federated learning over wireless networks: Theory, modeling, and optimization,”IEEE Transactions on Cognitive Communications and Networking, vol. 10, no. 5, pp. 1989–2004, 2024

work page 1989
[62]

Sharper convergence guarantees for asyn- chronous SGD for distributed and federated learning,

A. Koloskova, S. U. Stich, and M. Jaggi, “Sharper convergence guarantees for asyn- chronous SGD for distributed and federated learning,” inAdvances in Neural Informa- tion Processing Systems (NeurIPS), vol. 35, New Orleans, LA, USA: PMLR, Nov. 2022, pp. 17202–17215

work page 2022
[63]

A general sample complexity analysis of vanilla policy gradient,

R. Yuan, R. M. Gower, and A. Lazaric, “A general sample complexity analysis of vanilla policy gradient,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 3332–3380

work page 2022
[64]

Stochastic policy gradient methods: Improved sample complexity for Fisher-non-degenerate policies,

I. Fatkhullin, A. Barakat, A. Kireeva, and N. He, “Stochastic policy gradient methods: Improved sample complexity for Fisher-non-degenerate policies,” inInternational Con- ference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 202, Honolulu, HI, USA: PMLR, Jul. 2023, pp. 9827–9869

work page 2023
[65]

Momentum-based policy gradient methods,

F. Huang, S. Gao, J. Pei, and H. Huang, “Momentum-based policy gradient methods,” inInternational conference on machine learning (ICML), ser. Proceedings of Machine Learning Research, vol. 119, PMLR, Jul. 2020, pp. 4422–4433

work page 2020
[66]

On the global optimum convergence of momentum- based policy gradient,

Y. Ding, J. Zhang, and J. Lavaei, “On the global optimum convergence of momentum- based policy gradient,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 1910–1934

work page 2022
[67]

Improved sample complexity analysis of natural policy gradient algorithm with general parameterization for infinite horizon discounted reward Markov decision processes,

W. U. Mondal and V. Aggarwal, “Improved sample complexity analysis of natural policy gradient algorithm with general parameterization for infinite horizon discounted reward Markov decision processes,” inInternational Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2024, pp. 3097–3105

work page 2024
[68]

Efficient and light-weight federated learning via asynchronous distributed dropout,

C. Dun, M. Hipolito, C. Jermaine, D. Dimitriadis, and A. Kyrillidis, “Efficient and light-weight federated learning via asynchronous distributed dropout,” inProceedings of The 26th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 206, Palau de Congressos, Valencia, Spain: PML...

work page 2023
[69]

Federated Q-learning with reference-advantage decomposition: Almost optimal regret and logarithmic communication cost,

Z. Zheng, H. Zhang, and L. Xue, “Federated Q-learning with reference-advantage decomposition: Almost optimal regret and logarithmic communication cost,”arXiv preprint arXiv:2405.18795, 2024

work page arXiv 2024
[70]

The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond,

J. Woo, G. Joshi, and Y. Chi, “The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond,” inInternational Conference on Machine Learning (ICML), PMLR, 2023, pp. 37157–37216

work page 2023
[71]

The sample-communication complexity trade-off in federated Q-learning,

S. Salgia and Y. Chi, “The sample-communication complexity trade-off in federated Q-learning,”arXiv preprint arXiv:2408.16981, 2024

work page arXiv 2024
[72]

Finite-time analysis of on-policy het- erogeneous federated reinforcement learning,

C. Zhang, H. Wang, A. Mitra, and J. Anderson, “Finite-time analysis of on-policy het- erogeneous federated reinforcement learning,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[73]

Asynchronous methods for deep reinforcement learning,

V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” inProceedings of The 33rd International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 48, New York, NY, USA: PMLR, Jun. 2016, pp. 1928– 1937

work page 2016
[74]

Federated natural policy gradient and actor critic methods for multi-task reinforcement learning,

T. Yang, S. Cen, Y. Wei, Y. Chen, and Y. Chi, “Federated natural policy gradient and actor critic methods for multi-task reinforcement learning,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[75]

Communication-efficient policy gradient methods for distributed reinforcement learning,

T. Chen, K. Zhang, G. B. Giannakis, and T. Başar, “Communication-efficient policy gradient methods for distributed reinforcement learning,”IEEE Transactions on Control of Network Systems, vol. 9, no. 2, pp. 917–929, 2021

work page 2021
[76]

Momentum for the win: Collaborative federated reinforcement learning across heterogeneous environments,

H. Wang, S. He, Z. Zhang, F. Miao, and J. Anderson, “Momentum for the win: Collaborative federated reinforcement learning across heterogeneous environments,” in International Conference on Machine Learning (ICML), 2024

work page 2024
[77]

Global convergence guarantees for federated policy gradient methods with adversaries,

S. Ganesh, J. Chen, G. Thoppe, and V. Aggarwal, “Global convergence guarantees for federated policy gradient methods with adversaries,”Transactions on Machine Learning Research, 2024,issn: 2835-8856

work page 2024
[78]

Learning to summarize with human feedback,

N. Stiennon et al., “Learning to summarize with human feedback,”Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 3008–3021, 2020

work page 2020
[79]

2023 , journal =

S. Casper et al., “Open problems and fundamental limitations of reinforcement learning from human feedback,”arXiv preprint arXiv:2307.15217, 2023

work page arXiv 2023
[80]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

J. Dai et al., “Safe RLHF: Safe reinforcement learning from human feedback,”arXiv preprint arXiv:2310.12773, 2023. 106

work page internal anchor Pith review arXiv 2023

Showing first 80 references.