pith. machine review for the scientific record. sign in

arxiv: 2605.08378 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

Guangchen Lan

Pith reviewed 2026-05-12 01:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords learningreinforcementdissertationintelligentsystemstrustworthyfederatedoptimization
0
0 comments X

The pith

Reinforcement learning provides a unifying framework for both scalable optimization in distributed settings and trustworthy behavior aligned with human preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The dissertation identifies two central challenges for reinforcement learning in intelligent systems: efficient scaling in federated environments with limited communication and heterogeneous computation, and ensuring policies align with human preferences while meeting safety requirements such as privacy-aware disclosure. It addresses these through four complementary contributions in federated optimization, preference alignment, and contextual safety. A sympathetic reader would care because applications like large language models and autonomous agents increasingly rely on reinforcement learning for post-training yet face practical barriers in both efficiency and trust. If the contributions succeed, they would enable deployment of reinforcement learning where both computational constraints and alignment demands must be met simultaneously. The work as a whole claims that reinforcement learning supplies the necessary tools for the next generation of such systems.

Core claim

Reinforcement learning has become a powerful paradigm for improving the capability of intelligent systems, but its practical deployment faces two central challenges: scaling efficiently in distributed environments where communication bandwidth is limited and computation is heterogeneous across agents, and ensuring optimized policies align with human preferences and satisfy safety requirements such as privacy-aware information disclosure. This dissertation addresses both challenges through four complementary contributions spanning federated optimization, preference alignment, and contextual safety. Together these contributions make reinforcement learning more scalable through communication-�c

What carries the argument

The four complementary contributions in federated optimization and preference alignment that advance reinforcement learning along the dimensions of communication-efficient scalability and human-preference alignment.

If this is right

  • Reinforcement learning policies can be optimized efficiently in federated settings with limited communication bandwidth and heterogeneous computation across agents.
  • Optimized policies for large language models can be aligned with human preferences.
  • Intelligent systems can satisfy safety requirements such as privacy-aware information disclosure in language-based agents.
  • Reinforcement learning supplies a single framework that addresses both efficient optimization and trustworthy behavior in next-generation intelligent systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to real-time decision systems in robotics or autonomous vehicles where similar federated and alignment methods might ensure both low-latency operation and ethical constraints.
  • If the methods prove robust, they might inform regulatory standards for deploying reinforcement learning in safety-critical distributed networks.
  • A natural next test would be whether these techniques combine with existing post-training methods like RLHF to produce measurable gains in multi-agent scenarios.
  • The work leaves open whether the same unification holds when scaling to very large models or highly dynamic environments beyond the dissertation's focus.
  • keywords
  • msc
  • pacs
  • feed_headline

Load-bearing premise

That the four unspecified complementary contributions in federated optimization and preference alignment actually deliver measurable improvements in scalability and safety.

What would settle it

Empirical results from the proposed federated algorithms showing no reduction in communication costs or alignment techniques that fail to reduce unsafe disclosures in language-model outputs would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.08378 by Guangchen Lan.

Figure 2.1
Figure 2.1. Figure 2.1: An illustration of federated learning based on second-order methods with N agents. (a) FedNPG via standard average. In the uplink, transmitting the matrix Hi brings O(d 2 ) communication complexity. (b) FedNPG-ADMM in this paper with only O(d) communication complexity. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_2_1.png] view at source ↗
Figure 2.2
Figure 2.2. Figure 2.2: Reward performances of standard average FedNPG and FedNPG￾ADMM on MuJoCo tasks, where N is the number of federated agents. Top: Swimmer-v4, Bottom: Hopper-v4. Left: FedNPG with O(d 2 ) communication complexity, Right: FedNPG-ADMM with O(d) communication complexity [PITH_FULL_IMAGE:figures/full_fig_p032_2_2.png] view at source ↗
Figure 2.3
Figure 2.3. Figure 2.3: Comparisons of FedPPO, standard average FedNPG, and FedNPG￾ADMM on MuJoCo tasks, where the number of federated agents N is 8 and the communication overhead is measured by the transmitted bytes in each agent. Left: Swimmer-v4 task, Right: Humanoid-v4 task, Top: Reward performances, Bottom: Communication overhead [PITH_FULL_IMAGE:figures/full_fig_p033_2_3.png] view at source ↗
Figure 2.4
Figure 2.4. Figure 2.4: Reward performances of FedNPG-ADMM on the Swimmer-v4 task with agent selection. In each iteration, the server randomly selects 100%, 75%, and 50% of agents for the aggregation. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_2_4.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: An illustration of the asynchronous federated policy gradient updates. Each agent has a local copy of the environment, and agents may collect data according to different local policies. At each iteration, the agent in the yellow color finishes the local process and then communicates with the server, while the other agents keep sampling and computing local gradients in parallel. In the k-th global iterati… view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: Visualization of the four MuJoCo tasks considered in this paper for experiments. Environment: To validate the effectiveness of our approach via experiments, we consider four popular MuJoCo environments for robotic control (Swimmer-v4, Hopper-v4, Walker2D￾v4, and Humanoid-v4) [52] with the MIT License. Both the state and action spaces are continuous. Environmental details are described in Table B.1 in App… view at source ↗
Figure 3.3
Figure 3.3. Figure 3.3: Reward performances of AFedPG (N = 2, 4, 8) and PG (N = 1) on various MuJoCo environments, where N is the number of federated agents. The solid lines are averaged results over 10 runs with random seeds from 0 to 9. The shadowed areas are confidence intervals with 95% confidence level. and the shadowed areas are confidence intervals with the confidence level 95%. The lines are smoothed for better visualiz… view at source ↗
Figure 3.4
Figure 3.4. Figure 3.4: Global time of AFedPG and FedPG with certain numbers of collected samples on various MuJoCo environments, where N is the number of federated agents. The solid lines are averaged results over 10 runs. The shadowed areas are confidence intervals with 95% confidence level. Baselines: We first consider the conventional PG approach with N = 1, to see the effect of using multiple agents for improving sample co… view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: An example of (x, yw, yl) pair. Both responses yw and yl have good quality as they achieve high rewards, where r(x, yw) = 0.95, r(x, yl) = 0.91, and r ∈ [0, 1]. Both yw and yl have high rewards, which reflect the high qualities. However, in MLE and its derived DPO, the learning objective is nothing but to increase the gap between yw and yl , regardless of the fact that both of them have high qualities wi… view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: Under the standard MLE-based DPO (left), empirical studies [84– 86, 88] demonstrated that training tends to simultaneously downscale (with different magnitudes) both the chosen and rejected responses to increase their gap. Our MaP-based method (right) mitigates this harmful tendency by re￾weighting the rejected response based on prior knowledge. Here, the x-axis denotes the initial model θ0 and a potenti… view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: Illustration of the iterative MaPPO pipeline in each iteration k. With a prompt set D, we can equally divide D into K subsets as D1 · · · DK. In the k-th iteration, we first freeze the current policy model πθ, and then get responses (y1, y2) from the policy according to the prompt set Dk. We then use a reward model to get the responses’ corresponding rewards and collect (yw, yl) pairs, which reflect the … view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p074_5.png] view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: Contextual integrity (CI) violations in agents arise when they fail to recognize the appropriateness of the sharing of background information for a given context. We propose a framework that explicitly reasons about the contextual appropriateness of each user attribute. In this context, the attributes in green are appropriate to share whereas the attributes in red are inappropriate. In this illustration,… view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: Prompt template for contextual integrity reasoning. Seed scenario, domain, transmission principle Vignette actors + CI slots Dataset item {task, info, annotation} [PITH_FULL_IMAGE:figures/full_fig_p081_5_2.png] view at source ↗
read the original abstract

Reinforcement learning has become a powerful paradigm for improving the capability of intelligent systems, but its practical deployment faces two central challenges. First, reinforcement learning must scale efficiently in distributed environments where communication bandwidth is limited and computation is heterogeneous across agents. Second, as reinforcement learning is increasingly used in post-training large language models and autonomous agents, the optimized policies must also be aligned with human preferences and satisfy safety requirements such as privacy-aware information disclosure. This dissertation addresses both challenges through four complementary contributions spanning federated optimization, preference alignment, and contextual safety. The first part of the dissertation studies scalable reinforcement learning in federated settings. The second part of the dissertation studies trustworthy reinforcement learning for large language models. Together, these contributions advance reinforcement learning along two complementary dimensions. On the one hand, they make reinforcement learning more scalable through communication-efficient and asynchronous federated optimization. On the other hand, they make reinforcement learning more trustworthy by improving alignment with human preferences and by reducing contextually inappropriate information disclosure in language-based intelligent systems. As a whole, this dissertation argues that the next generation of intelligent systems will require both efficient optimization and trustworthy behavior, and that reinforcement learning provides a unifying framework for addressing both goals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. This dissertation argues that reinforcement learning provides a unifying framework for next-generation intelligent systems by addressing two challenges: efficient scaling in distributed federated environments with limited communication and heterogeneous computation, and trustworthiness in post-training of large language models and autonomous agents via human preference alignment and safety constraints such as privacy-aware disclosure. It structures the work into two parts—scalable federated RL through communication-efficient and asynchronous optimization, and trustworthy RL for LLMs—with four complementary contributions spanning federated optimization, preference alignment, and contextual safety.

Significance. If the four contributions deliver measurable gains in scalability and safety, the work could help unify efficiency and alignment research in RL applications to modern AI. The high-level framing is conceptually coherent with ongoing trends in federated learning and LLM post-training, but the manuscript offers no methods, experiments, quantitative results, or error analysis to substantiate the claims, leaving the significance prospective rather than established.

major comments (1)
  1. [Abstract] Abstract and overall structure: the manuscript asserts that the four unspecified contributions in federated optimization, preference alignment, and contextual safety 'advance reinforcement learning along two complementary dimensions' and realize a unifying framework, yet provides no algorithms, derivations, experimental protocols, datasets, metrics, or results to support these assertions. This absence is load-bearing for the central claim, as the thesis reduces to an unverified assertion that applying RL to both problem classes advances both goals.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We agree that the abstract requires revision to better substantiate its claims with references to the specific methods and results from the four contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and overall structure: the manuscript asserts that the four unspecified contributions in federated optimization, preference alignment, and contextual safety 'advance reinforcement learning along two complementary dimensions' and realize a unifying framework, yet provides no algorithms, derivations, experimental protocols, datasets, metrics, or results to support these assertions. This absence is load-bearing for the central claim, as the thesis reduces to an unverified assertion that applying RL to both problem classes advances both goals.

    Authors: We acknowledge the referee's observation that the submitted abstract summarizes the contributions at a high level without including the supporting technical details. The full dissertation contains four specific contributions, each with algorithms (including communication-efficient and asynchronous federated RL methods), derivations, experimental protocols, datasets, metrics, and quantitative results demonstrating scalability gains and improved alignment/safety. To address this directly, we will revise the abstract to name the four contributions explicitly and briefly reference their key methods and empirical outcomes, thereby grounding the unifying framework claim. We will also update the overall structure description to cross-reference the detailed chapters. revision: yes

Circularity Check

0 steps flagged

No circularity: high-level thesis with no derivations or equations

full rationale

The manuscript is a dissertation abstract and high-level position statement asserting that reinforcement learning supplies a unifying framework for scalable federated optimization and trustworthy preference alignment/safety. No equations, first-principles derivations, fitted parameters, or quantitative predictions appear in the provided text. The central claim is an organizational assertion about complementary contributions rather than a mathematical reduction; it does not define any quantity in terms of itself, rename a known result, or rely on self-citation chains for load-bearing steps. The argument is therefore self-contained at the level of scope description and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical content is present; the abstract introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5502 in / 925 out tokens · 37717 ms · 2026-05-12T01:43:14.819874+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

258 extracted references · 258 canonical work pages · 23 internal anchors

  1. [1]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning,

    R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine learning, vol. 8, pp. 229–256, 1992

  2. [2]

    DeepPool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning,

    A. O. Al-Abbasi, A. Ghosh, and V. Aggarwal, “DeepPool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning,”IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 12, pp. 4714–4727, 2019

  3. [3]

    A reinforcement learning framework for vehicular network routing under peak and average constraints,

    N. Geng et al., “A reinforcement learning framework for vehicular network routing under peak and average constraints,”IEEE Transactions on Vehicular Technology, vol. 72, no. 5, pp. 6753–6764, 2023

  4. [4]

    Two-tiered online optimization of region-wide datacenter resource allocation via deep reinforcement learning,

    C.-L. Chen et al., “Two-tiered online optimization of region-wide datacenter resource allocation via deep reinforcement learning,”arXiv preprint arXiv:2306.17054, 2023

  5. [5]

    ASAP: A semi-autonomous precise system for telesurgery during communication delays,

    G. Gonzalez et al., “ASAP: A semi-autonomous precise system for telesurgery during communication delays,”IEEE Transactions on Medical Robotics and Bionics, vol. 5, no. 1, pp. 66–78, 2023

  6. [6]

    Liquid-augmented MPC in quadrupedal robot for disturbance learning,

    Y. Mao, Y. Zhang, and L. Gao, “Liquid-augmented MPC in quadrupedal robot for disturbance learning,”Electronics, vol. 14, no. 24, p. 4843, 2025

  7. [7]

    Invertible liquid neural network-based learning of inverse kinematics and dynamics for robotic manipulators,

    Y. Zhang et al., “Invertible liquid neural network-based learning of inverse kinematics and dynamics for robotic manipulators,”Scientific Reports, vol. 15, no. 1, p. 42311, 2025

  8. [8]

    Training language models to follow instructions with human feedback,

    L. Ouyang et al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, New Orleans, LA, USA: Curran Associates, Inc., Nov. 2022, pp. 27730–27744

  9. [9]

    GPT-4 Technical Report

    OpenAI, “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  10. [10]

    Gemma: Open Models Based on Gemini Research and Technology

    G. Team and G. DeepMind, “Gemma: Open models based on Gemini research and technology,”arXiv preprint arXiv:2403.08295, 2024

  11. [11]

    Data science and its relationship to big data and data-driven decision making,

    F. Provost and T. Fawcett, “Data science and its relationship to big data and data-driven decision making,”Big Data, vol. 1, no. 1, pp. 51–59, 2013

  12. [12]

    Distributed learning in wireless sensor networks,

    J. B. Predd, S. B. Kulkarni, and H. V. Poor, “Distributed learning in wireless sensor networks,”IEEE Signal Processing Magazine, vol. 23, no. 4, pp. 56–69, 2006

  13. [13]

    Federated learning for wireless communica- tions: Motivation, opportunities, and challenges,

    S. Niknam, H. S. Dhillon, and J. H. Reed, “Federated learning for wireless communica- tions: Motivation, opportunities, and challenges,”IEEE Communications Magazine, vol. 58, no. 6, pp. 46–51, 2020. 100

  14. [14]

    Deep reinforcement learning from human preferences,

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,”Advances in Neural Information processing Systems (NeurIPS), vol. 30, 2017

  15. [15]

    Nissenbaum,Privacy in context: Technology, policy, and the integrity of social life

    H. Nissenbaum,Privacy in context: Technology, policy, and the integrity of social life. Stanford University Press, 2009

  16. [16]

    Anaturalpolicygradient,

    S.M.Kakade,“Anaturalpolicygradient,”inAdvances in Neural Information Processing Systems (NIPS), vol. 14, Vancouver, Canada: MIT Press, 2001

  17. [17]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  18. [18]

    Trust region policy optimization,

    J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,”arXiv preprint arXiv:1502.05477, 2017

  19. [19]

    Communication- efficient learning of deep networks from decentralized data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication- efficient learning of deep networks from decentralized data,” inInternational Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learn- ing Research, vol. 54, Ft. Lauderdale, FL, USA: PMLR, Apr. 2017, pp. 1273–1282

  20. [20]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, pp. 53728–53741

  21. [21]

    Can LLMs keep a secret? testing privacy implications of lan- guage models via contextual integrity theory,

    N. Mireshghallah et al., “Can LLMs keep a secret? testing privacy implications of lan- guage models via contextual integrity theory,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024

  22. [22]

    PrivacyLens: Evaluating privacy norm awareness of language models in action,

    Y. Shao, T. Li, W. Shi, Y. Liu, and D. Yang, “PrivacyLens: Evaluating privacy norm awareness of language models in action,” inThe Thirty-eight Conference on Neural Information Processing (NeurIPS) Systems Datasets and Benchmarks Track, 2024

  23. [23]

    Communication-efficient federated learning for resource-constrained edge devices,

    G. Lan, X.-Y. Liu, Y. Zhang, and X. Wang, “Communication-efficient federated learning for resource-constrained edge devices,”IEEE Transactions on Machine Learning in Communications and Networking, vol. 1, pp. 210–224, 2023

  24. [24]

    Improved communication efficiency in federated natural policy gradient via ADMM-based gradient updates,

    G. Lan, H. Wang, J. Anderson, C. Brinton, and V. Aggarwal, “Improved communication efficiency in federated natural policy gradient via ADMM-based gradient updates,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024

  25. [25]

    Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis,

    G. Lan, D.-J. Han, A. Hashemi, V. Aggarwal, and C. Brinton, “Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis,” inThe Thirteenth International Conference on Learning Representations (ICLR), 2025. 101

  26. [26]

    Privacy and contextual integrity: Framework and applications,

    A. Barth, A. Datta, J. C. Mitchell, and H. Nissenbaum, “Privacy and contextual integrity: Framework and applications,” inIEEE symposium on security and privacy (S&P), 2006

  27. [27]

    Contextual integrity in LLMs via reasoning and reinforcement learning,

    G. Lan et al., “Contextual integrity in LLMs via reasoning and reinforcement learning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  28. [28]

    FedNew: A communication-efficient and privacy-preserving Newton-type method for federated learning,

    A. Elgabli, C. B. Issaid, A. S. Bedi, K. Rajawat, M. Bennis, and V. Aggarwal, “FedNew: A communication-efficient and privacy-preserving Newton-type method for federated learning,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 162, Baltimore, MD, USA: PMLR, Jul. 2022, pp. 5861–5877

  29. [29]

    Communication-efficient distributed optimization using an approximate Newton-type method,

    O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 32, Bejing, China: PMLR, Jun. 2014, pp. 1000–1008

  30. [30]

    FedNL: Making Newton-type methods applicable to federated learning,

    M. Safaryan, R. Islamov, X. Qian, and P. Richtarik, “FedNL: Making Newton-type methods applicable to federated learning,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 162, Baltimore, MD, USA: PMLR, Jul. 2022, pp. 18959–19010

  31. [31]

    Basismatters:Bettercommunication- efficient second order methods for federated learning,

    X.Qian,R.Islamov,M.Safaryan,andP.Richtarik,“Basismatters:Bettercommunication- efficient second order methods for federated learning,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Pro- ceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 680–720

  32. [32]

    Distributed second order methods with fast rates and compressed communication,

    R. Islamov, X. Qian, and P. Richtarik, “Distributed second order methods with fast rates and compressed communication,” inProceedings of the 38th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 139, PMLR, Jul. 2021, pp. 4617–4628

  33. [33]

    Improved communication efficiency in federated natural policy gradient via ADMM-based gradient updates,

    G. Lan, H. Wang, J. Anderson, C. Brinton, and V. Aggarwal, “Improved communication efficiency in federated natural policy gradient via ADMM-based gradient updates,” inThirty-seventh Conference on Neural Information Processing Systems (NeurIPS), vol. 36, New Orleans, LA, USA: Curran Associates, Inc., Dec. 2023, pp. 59873–59885

  34. [34]

    GADMM: Fast and com- munication efficient framework for distributed machine learning,

    A. Elgabli, J. Park, A. S. Bedi, M. Bennis, and V. Aggarwal, “GADMM: Fast and com- munication efficient framework for distributed machine learning,”Journal of Machine Learning Research, vol. 21, no. 76, pp. 1–39, 2020

  35. [35]

    Q-GADMM: Quantized group ADMM for communication efficient decentralized machine learning,

    A. Elgabli, J. Park, A. S. Bedi, C. B. Issaid, M. Bennis, and V. Aggarwal, “Q-GADMM: Quantized group ADMM for communication efficient decentralized machine learning,” IEEE Transactions on Communications, vol. 69, no. 1, pp. 164–181, 2020. 102

  36. [36]

    FedADMM: A federated primal-dual algorithm allowing partial participation,

    H. Wang, S. Marella, and J. Anderson, “FedADMM: A federated primal-dual algorithm allowing partial participation,” inIEEE 61st Conference on Decision and Control (CDC), Cancun, Mexico, 2022

  37. [37]

    Federated deep reinforcement learning,

    H. H. Zhuo, W. Feng, Y. Lin, Q. Xu, and Q. Yang, “Federated deep reinforcement learning,”arXiv preprint arXiv:1901.08277, 2020

  38. [38]

    arXiv preprint arXiv:2108.11887 , year =

    J. Qi, Q. Zhou, L. Lei, and K. Zheng, “Federated reinforcement learning: Techniques, applications, and open challenges,”arXiv preprint arXiv:2108.11887, 2021

  39. [39]

    Federated reinforcement learning: Linear speedup under Markovian sampling,

    S. Khodadadian, P. Sharma, G. Joshi, and S. T. Maguluri, “Federated reinforcement learning: Linear speedup under Markovian sampling,” inInternational Conference on Machine Learning (ICML), vol. 162, Baltimore, MD, USA: PMLR, 2022, pp. 10997– 11057

  40. [40]

    Federated reinforcement learning with environment heterogeneity,

    H. Jin, Y. Peng, W. Yang, S. Wang, and Z. Zhang, “Federated reinforcement learning with environment heterogeneity,” inInternational Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 18–37

  41. [41]

    Federated TD learning with linear function approximation under environmental heterogeneity,

    H. Wang, A. Mitra, H. Hassani, G. J. Pappas, and J. Anderson, “Federated TD learning with linear function approximation under environmental heterogeneity,”Transactions on Machine Learning Research, 2024,issn: 2835-8856

  42. [42]

    FedKL: Tackling data heterogeneity in federated reinforcement learning by penalizing KL divergence,

    Z. Xie and S. Song, “FedKL: Tackling data heterogeneity in federated reinforcement learning by penalizing KL divergence,”IEEE Journal on Selected Areas in Communi- cations, vol. 41, no. 4, pp. 1227–1242, 2023

  43. [43]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional contin- uous control using generalized advantage estimation,”arXiv preprint arXiv:1506.02438, 2018

  44. [44]

    On the theory of policy gradient methods: Optimality, approximation, and distribution shift,

    A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,”Journal of Machine Learning Research, vol. 22, no. 1, pp. 4431–4506, 2021

  45. [45]

    MinibatchvslocalSGDforheterogeneous distributed learning,

    B.E.Woodworth,K.K.Patel,andN.Srebro,“MinibatchvslocalSGDforheterogeneous distributed learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, Curran Associates, Inc., 2020, pp. 6281–6292

  46. [46]

    [Online]

    speedtest.net,Speedtest market report in the United States, 2023. [Online]. Available: http://www.speedtest.net/reports/united-states

  47. [47]

    Distributed optimization and statistical learning via the alternating direction method of multipliers,

    S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,”Foundations and Trends®in Machine learning, vol. 3, no. 1, pp. 1–122, 2011. 103

  48. [48]

    Natural policy gradient primal-dual method for constrained Markov decision processes,

    D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained Markov decision processes,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, Curran Associates, Inc., 2020, pp. 8378–8390

  49. [49]

    An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods,

    Y. Liu, K. Zhang, T. Basar, and W. Yin, “An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, Curran Associates, Inc., 2020, pp. 7624–7636

  50. [50]

    Stochastic variance- reducedpolicygradient,

    M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli, “Stochastic variance- reducedpolicygradient,”inProceedings of the 35th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 80, Stockholm, Sweden: PMLR, Jul. 2018, pp. 4026–4035

  51. [51]

    Sample efficient policy gradient methods with recursive variance reduction,

    P. Xu, F. Gao, and Q. Gu, “Sample efficient policy gradient methods with recursive variance reduction,” inInternational Conference on Learning Representations (ICLR), 2020

  52. [52]

    MuJoCo: A physics engine for model-based control,

    E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura-Algarve, Portugal, 2012, pp. 5026–5033

  53. [53]

    Stable- baselines3: Reliable reinforcement learning implementations,

    A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable- baselines3: Reliable reinforcement learning implementations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021

  54. [54]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

  55. [55]

    PyTorch: An imperative style, high-performance deep learning library,

    A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, Vancouver, Canada: Curran Associates, Inc., Dec. 2019

  56. [56]

    Personalized federated reinforcement learn- ing with shared representations,

    G. Xiong, S. Wang, D. Jiang, and J. Li, “Personalized federated reinforcement learn- ing with shared representations,” inDeployable RL: From Research to Practice @ Reinforcement Learning Conference, 2024

  57. [57]

    Asynchronous federated optimization,

    C. Xie, S. Koyejo, and I. Gupta, “Asynchronous federated optimization,” in12th Annual Workshop on Optimization for Machine Learning (OPT), 2020

  58. [58]

    VAFL: A method of vertical asynchronous federated learning,

    T. Chen, X. Jin, Y. Sun, and W. Yin, “VAFL: A method of vertical asynchronous federated learning,”arXiv preprint arXiv:2007.06081, 2020

  59. [59]

    Single-forking of coded subtasks for straggler mitigation,

    A. Badita, P. Parag, and V. Aggarwal, “Single-forking of coded subtasks for straggler mitigation,”IEEE/ACM Transactions on Networking, vol. 29, no. 6, pp. 2413–2424, 2021. 104

  60. [60]

    Asynchronous SGD beats minibatch SGD under arbitrary delays,

    K. Mishchenko, F. Bach, M. Even, and B. E. Woodworth, “Asynchronous SGD beats minibatch SGD under arbitrary delays,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, New Orleans, LA, USA: PMLR, Nov. 2022, pp. 420–433

  61. [61]

    Asynchronous multi- model dynamic federated learning over wireless networks: Theory, modeling, and optimization,

    Z.-L. Chang, S. Hosseinalipour, M. Chiang, and C. G. Brinton, “Asynchronous multi- model dynamic federated learning over wireless networks: Theory, modeling, and optimization,”IEEE Transactions on Cognitive Communications and Networking, vol. 10, no. 5, pp. 1989–2004, 2024

  62. [62]

    Sharper convergence guarantees for asyn- chronous SGD for distributed and federated learning,

    A. Koloskova, S. U. Stich, and M. Jaggi, “Sharper convergence guarantees for asyn- chronous SGD for distributed and federated learning,” inAdvances in Neural Informa- tion Processing Systems (NeurIPS), vol. 35, New Orleans, LA, USA: PMLR, Nov. 2022, pp. 17202–17215

  63. [63]

    A general sample complexity analysis of vanilla policy gradient,

    R. Yuan, R. M. Gower, and A. Lazaric, “A general sample complexity analysis of vanilla policy gradient,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 3332–3380

  64. [64]

    Stochastic policy gradient methods: Improved sample complexity for Fisher-non-degenerate policies,

    I. Fatkhullin, A. Barakat, A. Kireeva, and N. He, “Stochastic policy gradient methods: Improved sample complexity for Fisher-non-degenerate policies,” inInternational Con- ference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 202, Honolulu, HI, USA: PMLR, Jul. 2023, pp. 9827–9869

  65. [65]

    Momentum-based policy gradient methods,

    F. Huang, S. Gao, J. Pei, and H. Huang, “Momentum-based policy gradient methods,” inInternational conference on machine learning (ICML), ser. Proceedings of Machine Learning Research, vol. 119, PMLR, Jul. 2020, pp. 4422–4433

  66. [66]

    On the global optimum convergence of momentum- based policy gradient,

    Y. Ding, J. Zhang, and J. Lavaei, “On the global optimum convergence of momentum- based policy gradient,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 1910–1934

  67. [67]

    Improved sample complexity analysis of natural policy gradient algorithm with general parameterization for infinite horizon discounted reward Markov decision processes,

    W. U. Mondal and V. Aggarwal, “Improved sample complexity analysis of natural policy gradient algorithm with general parameterization for infinite horizon discounted reward Markov decision processes,” inInternational Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2024, pp. 3097–3105

  68. [68]

    Efficient and light-weight federated learning via asynchronous distributed dropout,

    C. Dun, M. Hipolito, C. Jermaine, D. Dimitriadis, and A. Kyrillidis, “Efficient and light-weight federated learning via asynchronous distributed dropout,” inProceedings of The 26th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 206, Palau de Congressos, Valencia, Spain: PML...

  69. [69]

    Federated Q-learning with reference-advantage decomposition: Almost optimal regret and logarithmic communication cost,

    Z. Zheng, H. Zhang, and L. Xue, “Federated Q-learning with reference-advantage decomposition: Almost optimal regret and logarithmic communication cost,”arXiv preprint arXiv:2405.18795, 2024

  70. [70]

    The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond,

    J. Woo, G. Joshi, and Y. Chi, “The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond,” inInternational Conference on Machine Learning (ICML), PMLR, 2023, pp. 37157–37216

  71. [71]

    The sample-communication complexity trade-off in federated Q-learning,

    S. Salgia and Y. Chi, “The sample-communication complexity trade-off in federated Q-learning,”arXiv preprint arXiv:2408.16981, 2024

  72. [72]

    Finite-time analysis of on-policy het- erogeneous federated reinforcement learning,

    C. Zhang, H. Wang, A. Mitra, and J. Anderson, “Finite-time analysis of on-policy het- erogeneous federated reinforcement learning,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024

  73. [73]

    Asynchronous methods for deep reinforcement learning,

    V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” inProceedings of The 33rd International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 48, New York, NY, USA: PMLR, Jun. 2016, pp. 1928– 1937

  74. [74]

    Federated natural policy gradient and actor critic methods for multi-task reinforcement learning,

    T. Yang, S. Cen, Y. Wei, Y. Chen, and Y. Chi, “Federated natural policy gradient and actor critic methods for multi-task reinforcement learning,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024

  75. [75]

    Communication-efficient policy gradient methods for distributed reinforcement learning,

    T. Chen, K. Zhang, G. B. Giannakis, and T. Başar, “Communication-efficient policy gradient methods for distributed reinforcement learning,”IEEE Transactions on Control of Network Systems, vol. 9, no. 2, pp. 917–929, 2021

  76. [76]

    Momentum for the win: Collaborative federated reinforcement learning across heterogeneous environments,

    H. Wang, S. He, Z. Zhang, F. Miao, and J. Anderson, “Momentum for the win: Collaborative federated reinforcement learning across heterogeneous environments,” in International Conference on Machine Learning (ICML), 2024

  77. [77]

    Global convergence guarantees for federated policy gradient methods with adversaries,

    S. Ganesh, J. Chen, G. Thoppe, and V. Aggarwal, “Global convergence guarantees for federated policy gradient methods with adversaries,”Transactions on Machine Learning Research, 2024,issn: 2835-8856

  78. [78]

    Learning to summarize with human feedback,

    N. Stiennon et al., “Learning to summarize with human feedback,”Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 3008–3021, 2020

  79. [79]

    2023 , journal =

    S. Casper et al., “Open problems and fundamental limitations of reinforcement learning from human feedback,”arXiv preprint arXiv:2307.15217, 2023

  80. [80]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    J. Dai et al., “Safe RLHF: Safe reinforcement learning from human feedback,”arXiv preprint arXiv:2310.12773, 2023. 106

Showing first 80 references.