Recognition: no theorem link
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
Pith reviewed 2026-05-12 01:43 UTC · model grok-4.3
The pith
Reinforcement learning provides a unifying framework for both scalable optimization in distributed settings and trustworthy behavior aligned with human preferences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reinforcement learning has become a powerful paradigm for improving the capability of intelligent systems, but its practical deployment faces two central challenges: scaling efficiently in distributed environments where communication bandwidth is limited and computation is heterogeneous across agents, and ensuring optimized policies align with human preferences and satisfy safety requirements such as privacy-aware information disclosure. This dissertation addresses both challenges through four complementary contributions spanning federated optimization, preference alignment, and contextual safety. Together these contributions make reinforcement learning more scalable through communication-�c
What carries the argument
The four complementary contributions in federated optimization and preference alignment that advance reinforcement learning along the dimensions of communication-efficient scalability and human-preference alignment.
If this is right
- Reinforcement learning policies can be optimized efficiently in federated settings with limited communication bandwidth and heterogeneous computation across agents.
- Optimized policies for large language models can be aligned with human preferences.
- Intelligent systems can satisfy safety requirements such as privacy-aware information disclosure in language-based agents.
- Reinforcement learning supplies a single framework that addresses both efficient optimization and trustworthy behavior in next-generation intelligent systems.
Where Pith is reading between the lines
- The approach could extend to real-time decision systems in robotics or autonomous vehicles where similar federated and alignment methods might ensure both low-latency operation and ethical constraints.
- If the methods prove robust, they might inform regulatory standards for deploying reinforcement learning in safety-critical distributed networks.
- A natural next test would be whether these techniques combine with existing post-training methods like RLHF to produce measurable gains in multi-agent scenarios.
- The work leaves open whether the same unification holds when scaling to very large models or highly dynamic environments beyond the dissertation's focus.
- keywords
- msc
- pacs
- feed_headline
Load-bearing premise
That the four unspecified complementary contributions in federated optimization and preference alignment actually deliver measurable improvements in scalability and safety.
What would settle it
Empirical results from the proposed federated algorithms showing no reduction in communication costs or alignment techniques that fail to reduce unsafe disclosures in language-model outputs would falsify the central claim.
Figures
read the original abstract
Reinforcement learning has become a powerful paradigm for improving the capability of intelligent systems, but its practical deployment faces two central challenges. First, reinforcement learning must scale efficiently in distributed environments where communication bandwidth is limited and computation is heterogeneous across agents. Second, as reinforcement learning is increasingly used in post-training large language models and autonomous agents, the optimized policies must also be aligned with human preferences and satisfy safety requirements such as privacy-aware information disclosure. This dissertation addresses both challenges through four complementary contributions spanning federated optimization, preference alignment, and contextual safety. The first part of the dissertation studies scalable reinforcement learning in federated settings. The second part of the dissertation studies trustworthy reinforcement learning for large language models. Together, these contributions advance reinforcement learning along two complementary dimensions. On the one hand, they make reinforcement learning more scalable through communication-efficient and asynchronous federated optimization. On the other hand, they make reinforcement learning more trustworthy by improving alignment with human preferences and by reducing contextually inappropriate information disclosure in language-based intelligent systems. As a whole, this dissertation argues that the next generation of intelligent systems will require both efficient optimization and trustworthy behavior, and that reinforcement learning provides a unifying framework for addressing both goals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This dissertation argues that reinforcement learning provides a unifying framework for next-generation intelligent systems by addressing two challenges: efficient scaling in distributed federated environments with limited communication and heterogeneous computation, and trustworthiness in post-training of large language models and autonomous agents via human preference alignment and safety constraints such as privacy-aware disclosure. It structures the work into two parts—scalable federated RL through communication-efficient and asynchronous optimization, and trustworthy RL for LLMs—with four complementary contributions spanning federated optimization, preference alignment, and contextual safety.
Significance. If the four contributions deliver measurable gains in scalability and safety, the work could help unify efficiency and alignment research in RL applications to modern AI. The high-level framing is conceptually coherent with ongoing trends in federated learning and LLM post-training, but the manuscript offers no methods, experiments, quantitative results, or error analysis to substantiate the claims, leaving the significance prospective rather than established.
major comments (1)
- [Abstract] Abstract and overall structure: the manuscript asserts that the four unspecified contributions in federated optimization, preference alignment, and contextual safety 'advance reinforcement learning along two complementary dimensions' and realize a unifying framework, yet provides no algorithms, derivations, experimental protocols, datasets, metrics, or results to support these assertions. This absence is load-bearing for the central claim, as the thesis reduces to an unverified assertion that applying RL to both problem classes advances both goals.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We agree that the abstract requires revision to better substantiate its claims with references to the specific methods and results from the four contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract and overall structure: the manuscript asserts that the four unspecified contributions in federated optimization, preference alignment, and contextual safety 'advance reinforcement learning along two complementary dimensions' and realize a unifying framework, yet provides no algorithms, derivations, experimental protocols, datasets, metrics, or results to support these assertions. This absence is load-bearing for the central claim, as the thesis reduces to an unverified assertion that applying RL to both problem classes advances both goals.
Authors: We acknowledge the referee's observation that the submitted abstract summarizes the contributions at a high level without including the supporting technical details. The full dissertation contains four specific contributions, each with algorithms (including communication-efficient and asynchronous federated RL methods), derivations, experimental protocols, datasets, metrics, and quantitative results demonstrating scalability gains and improved alignment/safety. To address this directly, we will revise the abstract to name the four contributions explicitly and briefly reference their key methods and empirical outcomes, thereby grounding the unifying framework claim. We will also update the overall structure description to cross-reference the detailed chapters. revision: yes
Circularity Check
No circularity: high-level thesis with no derivations or equations
full rationale
The manuscript is a dissertation abstract and high-level position statement asserting that reinforcement learning supplies a unifying framework for scalable federated optimization and trustworthy preference alignment/safety. No equations, first-principles derivations, fitted parameters, or quantitative predictions appear in the provided text. The central claim is an organizational assertion about complementary contributions rather than a mathematical reduction; it does not define any quantity in terms of itself, rename a known result, or rely on self-citation chains for load-bearing steps. The argument is therefore self-contained at the level of scope description and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Simple statistical gradient-following algorithms for connectionist reinforcement learning,
R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine learning, vol. 8, pp. 229–256, 1992
work page 1992
-
[2]
DeepPool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning,
A. O. Al-Abbasi, A. Ghosh, and V. Aggarwal, “DeepPool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning,”IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 12, pp. 4714–4727, 2019
work page 2019
-
[3]
A reinforcement learning framework for vehicular network routing under peak and average constraints,
N. Geng et al., “A reinforcement learning framework for vehicular network routing under peak and average constraints,”IEEE Transactions on Vehicular Technology, vol. 72, no. 5, pp. 6753–6764, 2023
work page 2023
-
[4]
C.-L. Chen et al., “Two-tiered online optimization of region-wide datacenter resource allocation via deep reinforcement learning,”arXiv preprint arXiv:2306.17054, 2023
-
[5]
ASAP: A semi-autonomous precise system for telesurgery during communication delays,
G. Gonzalez et al., “ASAP: A semi-autonomous precise system for telesurgery during communication delays,”IEEE Transactions on Medical Robotics and Bionics, vol. 5, no. 1, pp. 66–78, 2023
work page 2023
-
[6]
Liquid-augmented MPC in quadrupedal robot for disturbance learning,
Y. Mao, Y. Zhang, and L. Gao, “Liquid-augmented MPC in quadrupedal robot for disturbance learning,”Electronics, vol. 14, no. 24, p. 4843, 2025
work page 2025
-
[7]
Y. Zhang et al., “Invertible liquid neural network-based learning of inverse kinematics and dynamics for robotic manipulators,”Scientific Reports, vol. 15, no. 1, p. 42311, 2025
work page 2025
-
[8]
Training language models to follow instructions with human feedback,
L. Ouyang et al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, New Orleans, LA, USA: Curran Associates, Inc., Nov. 2022, pp. 27730–27744
work page 2022
-
[9]
OpenAI, “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Gemma: Open Models Based on Gemini Research and Technology
G. Team and G. DeepMind, “Gemma: Open models based on Gemini research and technology,”arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Data science and its relationship to big data and data-driven decision making,
F. Provost and T. Fawcett, “Data science and its relationship to big data and data-driven decision making,”Big Data, vol. 1, no. 1, pp. 51–59, 2013
work page 2013
-
[12]
Distributed learning in wireless sensor networks,
J. B. Predd, S. B. Kulkarni, and H. V. Poor, “Distributed learning in wireless sensor networks,”IEEE Signal Processing Magazine, vol. 23, no. 4, pp. 56–69, 2006
work page 2006
-
[13]
Federated learning for wireless communica- tions: Motivation, opportunities, and challenges,
S. Niknam, H. S. Dhillon, and J. H. Reed, “Federated learning for wireless communica- tions: Motivation, opportunities, and challenges,”IEEE Communications Magazine, vol. 58, no. 6, pp. 46–51, 2020. 100
work page 2020
-
[14]
Deep reinforcement learning from human preferences,
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,”Advances in Neural Information processing Systems (NeurIPS), vol. 30, 2017
work page 2017
-
[15]
Nissenbaum,Privacy in context: Technology, policy, and the integrity of social life
H. Nissenbaum,Privacy in context: Technology, policy, and the integrity of social life. Stanford University Press, 2009
work page 2009
-
[16]
S.M.Kakade,“Anaturalpolicygradient,”inAdvances in Neural Information Processing Systems (NIPS), vol. 14, Vancouver, Canada: MIT Press, 2001
work page 2001
-
[17]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
Trust region policy optimization,
J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,”arXiv preprint arXiv:1502.05477, 2017
-
[19]
Communication- efficient learning of deep networks from decentralized data,
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication- efficient learning of deep networks from decentralized data,” inInternational Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learn- ing Research, vol. 54, Ft. Lauderdale, FL, USA: PMLR, Apr. 2017, pp. 1273–1282
work page 2017
-
[20]
Direct preference optimization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023, pp. 53728–53741
work page 2023
-
[21]
N. Mireshghallah et al., “Can LLMs keep a secret? testing privacy implications of lan- guage models via contextual integrity theory,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[22]
PrivacyLens: Evaluating privacy norm awareness of language models in action,
Y. Shao, T. Li, W. Shi, Y. Liu, and D. Yang, “PrivacyLens: Evaluating privacy norm awareness of language models in action,” inThe Thirty-eight Conference on Neural Information Processing (NeurIPS) Systems Datasets and Benchmarks Track, 2024
work page 2024
-
[23]
Communication-efficient federated learning for resource-constrained edge devices,
G. Lan, X.-Y. Liu, Y. Zhang, and X. Wang, “Communication-efficient federated learning for resource-constrained edge devices,”IEEE Transactions on Machine Learning in Communications and Networking, vol. 1, pp. 210–224, 2023
work page 2023
-
[24]
G. Lan, H. Wang, J. Anderson, C. Brinton, and V. Aggarwal, “Improved communication efficiency in federated natural policy gradient via ADMM-based gradient updates,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2024
work page 2024
-
[25]
G. Lan, D.-J. Han, A. Hashemi, V. Aggarwal, and C. Brinton, “Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis,” inThe Thirteenth International Conference on Learning Representations (ICLR), 2025. 101
work page 2025
-
[26]
Privacy and contextual integrity: Framework and applications,
A. Barth, A. Datta, J. C. Mitchell, and H. Nissenbaum, “Privacy and contextual integrity: Framework and applications,” inIEEE symposium on security and privacy (S&P), 2006
work page 2006
-
[27]
Contextual integrity in LLMs via reasoning and reinforcement learning,
G. Lan et al., “Contextual integrity in LLMs via reasoning and reinforcement learning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[28]
FedNew: A communication-efficient and privacy-preserving Newton-type method for federated learning,
A. Elgabli, C. B. Issaid, A. S. Bedi, K. Rajawat, M. Bennis, and V. Aggarwal, “FedNew: A communication-efficient and privacy-preserving Newton-type method for federated learning,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 162, Baltimore, MD, USA: PMLR, Jul. 2022, pp. 5861–5877
work page 2022
-
[29]
Communication-efficient distributed optimization using an approximate Newton-type method,
O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 32, Bejing, China: PMLR, Jun. 2014, pp. 1000–1008
work page 2014
-
[30]
FedNL: Making Newton-type methods applicable to federated learning,
M. Safaryan, R. Islamov, X. Qian, and P. Richtarik, “FedNL: Making Newton-type methods applicable to federated learning,” inInternational Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 162, Baltimore, MD, USA: PMLR, Jul. 2022, pp. 18959–19010
work page 2022
-
[31]
Basismatters:Bettercommunication- efficient second order methods for federated learning,
X.Qian,R.Islamov,M.Safaryan,andP.Richtarik,“Basismatters:Bettercommunication- efficient second order methods for federated learning,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Pro- ceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 680–720
work page 2022
-
[32]
Distributed second order methods with fast rates and compressed communication,
R. Islamov, X. Qian, and P. Richtarik, “Distributed second order methods with fast rates and compressed communication,” inProceedings of the 38th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 139, PMLR, Jul. 2021, pp. 4617–4628
work page 2021
-
[33]
G. Lan, H. Wang, J. Anderson, C. Brinton, and V. Aggarwal, “Improved communication efficiency in federated natural policy gradient via ADMM-based gradient updates,” inThirty-seventh Conference on Neural Information Processing Systems (NeurIPS), vol. 36, New Orleans, LA, USA: Curran Associates, Inc., Dec. 2023, pp. 59873–59885
work page 2023
-
[34]
GADMM: Fast and com- munication efficient framework for distributed machine learning,
A. Elgabli, J. Park, A. S. Bedi, M. Bennis, and V. Aggarwal, “GADMM: Fast and com- munication efficient framework for distributed machine learning,”Journal of Machine Learning Research, vol. 21, no. 76, pp. 1–39, 2020
work page 2020
-
[35]
Q-GADMM: Quantized group ADMM for communication efficient decentralized machine learning,
A. Elgabli, J. Park, A. S. Bedi, C. B. Issaid, M. Bennis, and V. Aggarwal, “Q-GADMM: Quantized group ADMM for communication efficient decentralized machine learning,” IEEE Transactions on Communications, vol. 69, no. 1, pp. 164–181, 2020. 102
work page 2020
-
[36]
FedADMM: A federated primal-dual algorithm allowing partial participation,
H. Wang, S. Marella, and J. Anderson, “FedADMM: A federated primal-dual algorithm allowing partial participation,” inIEEE 61st Conference on Decision and Control (CDC), Cancun, Mexico, 2022
work page 2022
-
[37]
Federated deep reinforcement learning,
H. H. Zhuo, W. Feng, Y. Lin, Q. Xu, and Q. Yang, “Federated deep reinforcement learning,”arXiv preprint arXiv:1901.08277, 2020
-
[38]
arXiv preprint arXiv:2108.11887 , year =
J. Qi, Q. Zhou, L. Lei, and K. Zheng, “Federated reinforcement learning: Techniques, applications, and open challenges,”arXiv preprint arXiv:2108.11887, 2021
-
[39]
Federated reinforcement learning: Linear speedup under Markovian sampling,
S. Khodadadian, P. Sharma, G. Joshi, and S. T. Maguluri, “Federated reinforcement learning: Linear speedup under Markovian sampling,” inInternational Conference on Machine Learning (ICML), vol. 162, Baltimore, MD, USA: PMLR, 2022, pp. 10997– 11057
work page 2022
-
[40]
Federated reinforcement learning with environment heterogeneity,
H. Jin, Y. Peng, W. Yang, S. Wang, and Z. Zhang, “Federated reinforcement learning with environment heterogeneity,” inInternational Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 18–37
work page 2022
-
[41]
Federated TD learning with linear function approximation under environmental heterogeneity,
H. Wang, A. Mitra, H. Hassani, G. J. Pappas, and J. Anderson, “Federated TD learning with linear function approximation under environmental heterogeneity,”Transactions on Machine Learning Research, 2024,issn: 2835-8856
work page 2024
-
[42]
FedKL: Tackling data heterogeneity in federated reinforcement learning by penalizing KL divergence,
Z. Xie and S. Song, “FedKL: Tackling data heterogeneity in federated reinforcement learning by penalizing KL divergence,”IEEE Journal on Selected Areas in Communi- cations, vol. 41, no. 4, pp. 1227–1242, 2023
work page 2023
-
[43]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional contin- uous control using generalized advantage estimation,”arXiv preprint arXiv:1506.02438, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
On the theory of policy gradient methods: Optimality, approximation, and distribution shift,
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift,”Journal of Machine Learning Research, vol. 22, no. 1, pp. 4431–4506, 2021
work page 2021
-
[45]
MinibatchvslocalSGDforheterogeneous distributed learning,
B.E.Woodworth,K.K.Patel,andN.Srebro,“MinibatchvslocalSGDforheterogeneous distributed learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, Curran Associates, Inc., 2020, pp. 6281–6292
work page 2020
- [46]
-
[47]
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al., “Distributed optimization and statistical learning via the alternating direction method of multipliers,”Foundations and Trends®in Machine learning, vol. 3, no. 1, pp. 1–122, 2011. 103
work page 2011
-
[48]
Natural policy gradient primal-dual method for constrained Markov decision processes,
D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained Markov decision processes,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, Curran Associates, Inc., 2020, pp. 8378–8390
work page 2020
-
[49]
An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods,
Y. Liu, K. Zhang, T. Basar, and W. Yin, “An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, Curran Associates, Inc., 2020, pp. 7624–7636
work page 2020
-
[50]
Stochastic variance- reducedpolicygradient,
M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli, “Stochastic variance- reducedpolicygradient,”inProceedings of the 35th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 80, Stockholm, Sweden: PMLR, Jul. 2018, pp. 4026–4035
work page 2018
-
[51]
Sample efficient policy gradient methods with recursive variance reduction,
P. Xu, F. Gao, and Q. Gu, “Sample efficient policy gradient methods with recursive variance reduction,” inInternational Conference on Learning Representations (ICLR), 2020
work page 2020
-
[52]
MuJoCo: A physics engine for model-based control,
E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura-Algarve, Portugal, 2012, pp. 5026–5033
work page 2012
-
[53]
Stable- baselines3: Reliable reinforcement learning implementations,
A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable- baselines3: Reliable reinforcement learning implementations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021
work page 2021
-
[54]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[55]
PyTorch: An imperative style, high-performance deep learning library,
A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 32, Vancouver, Canada: Curran Associates, Inc., Dec. 2019
work page 2019
-
[56]
Personalized federated reinforcement learn- ing with shared representations,
G. Xiong, S. Wang, D. Jiang, and J. Li, “Personalized federated reinforcement learn- ing with shared representations,” inDeployable RL: From Research to Practice @ Reinforcement Learning Conference, 2024
work page 2024
-
[57]
Asynchronous federated optimization,
C. Xie, S. Koyejo, and I. Gupta, “Asynchronous federated optimization,” in12th Annual Workshop on Optimization for Machine Learning (OPT), 2020
work page 2020
-
[58]
VAFL: A method of vertical asynchronous federated learning,
T. Chen, X. Jin, Y. Sun, and W. Yin, “VAFL: A method of vertical asynchronous federated learning,”arXiv preprint arXiv:2007.06081, 2020
-
[59]
Single-forking of coded subtasks for straggler mitigation,
A. Badita, P. Parag, and V. Aggarwal, “Single-forking of coded subtasks for straggler mitigation,”IEEE/ACM Transactions on Networking, vol. 29, no. 6, pp. 2413–2424, 2021. 104
work page 2021
-
[60]
Asynchronous SGD beats minibatch SGD under arbitrary delays,
K. Mishchenko, F. Bach, M. Even, and B. E. Woodworth, “Asynchronous SGD beats minibatch SGD under arbitrary delays,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, New Orleans, LA, USA: PMLR, Nov. 2022, pp. 420–433
work page 2022
-
[61]
Z.-L. Chang, S. Hosseinalipour, M. Chiang, and C. G. Brinton, “Asynchronous multi- model dynamic federated learning over wireless networks: Theory, modeling, and optimization,”IEEE Transactions on Cognitive Communications and Networking, vol. 10, no. 5, pp. 1989–2004, 2024
work page 1989
-
[62]
Sharper convergence guarantees for asyn- chronous SGD for distributed and federated learning,
A. Koloskova, S. U. Stich, and M. Jaggi, “Sharper convergence guarantees for asyn- chronous SGD for distributed and federated learning,” inAdvances in Neural Informa- tion Processing Systems (NeurIPS), vol. 35, New Orleans, LA, USA: PMLR, Nov. 2022, pp. 17202–17215
work page 2022
-
[63]
A general sample complexity analysis of vanilla policy gradient,
R. Yuan, R. M. Gower, and A. Lazaric, “A general sample complexity analysis of vanilla policy gradient,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 3332–3380
work page 2022
-
[64]
Stochastic policy gradient methods: Improved sample complexity for Fisher-non-degenerate policies,
I. Fatkhullin, A. Barakat, A. Kireeva, and N. He, “Stochastic policy gradient methods: Improved sample complexity for Fisher-non-degenerate policies,” inInternational Con- ference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 202, Honolulu, HI, USA: PMLR, Jul. 2023, pp. 9827–9869
work page 2023
-
[65]
Momentum-based policy gradient methods,
F. Huang, S. Gao, J. Pei, and H. Huang, “Momentum-based policy gradient methods,” inInternational conference on machine learning (ICML), ser. Proceedings of Machine Learning Research, vol. 119, PMLR, Jul. 2020, pp. 4422–4433
work page 2020
-
[66]
On the global optimum convergence of momentum- based policy gradient,
Y. Ding, J. Zhang, and J. Lavaei, “On the global optimum convergence of momentum- based policy gradient,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 151, PMLR, Mar. 2022, pp. 1910–1934
work page 2022
-
[67]
W. U. Mondal and V. Aggarwal, “Improved sample complexity analysis of natural policy gradient algorithm with general parameterization for infinite horizon discounted reward Markov decision processes,” inInternational Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2024, pp. 3097–3105
work page 2024
-
[68]
Efficient and light-weight federated learning via asynchronous distributed dropout,
C. Dun, M. Hipolito, C. Jermaine, D. Dimitriadis, and A. Kyrillidis, “Efficient and light-weight federated learning via asynchronous distributed dropout,” inProceedings of The 26th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 206, Palau de Congressos, Valencia, Spain: PML...
work page 2023
-
[69]
Z. Zheng, H. Zhang, and L. Xue, “Federated Q-learning with reference-advantage decomposition: Almost optimal regret and logarithmic communication cost,”arXiv preprint arXiv:2405.18795, 2024
-
[70]
The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond,
J. Woo, G. Joshi, and Y. Chi, “The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond,” inInternational Conference on Machine Learning (ICML), PMLR, 2023, pp. 37157–37216
work page 2023
-
[71]
The sample-communication complexity trade-off in federated Q-learning,
S. Salgia and Y. Chi, “The sample-communication complexity trade-off in federated Q-learning,”arXiv preprint arXiv:2408.16981, 2024
-
[72]
Finite-time analysis of on-policy het- erogeneous federated reinforcement learning,
C. Zhang, H. Wang, A. Mitra, and J. Anderson, “Finite-time analysis of on-policy het- erogeneous federated reinforcement learning,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[73]
Asynchronous methods for deep reinforcement learning,
V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” inProceedings of The 33rd International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 48, New York, NY, USA: PMLR, Jun. 2016, pp. 1928– 1937
work page 2016
-
[74]
Federated natural policy gradient and actor critic methods for multi-task reinforcement learning,
T. Yang, S. Cen, Y. Wei, Y. Chen, and Y. Chi, “Federated natural policy gradient and actor critic methods for multi-task reinforcement learning,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[75]
Communication-efficient policy gradient methods for distributed reinforcement learning,
T. Chen, K. Zhang, G. B. Giannakis, and T. Başar, “Communication-efficient policy gradient methods for distributed reinforcement learning,”IEEE Transactions on Control of Network Systems, vol. 9, no. 2, pp. 917–929, 2021
work page 2021
-
[76]
H. Wang, S. He, Z. Zhang, F. Miao, and J. Anderson, “Momentum for the win: Collaborative federated reinforcement learning across heterogeneous environments,” in International Conference on Machine Learning (ICML), 2024
work page 2024
-
[77]
Global convergence guarantees for federated policy gradient methods with adversaries,
S. Ganesh, J. Chen, G. Thoppe, and V. Aggarwal, “Global convergence guarantees for federated policy gradient methods with adversaries,”Transactions on Machine Learning Research, 2024,issn: 2835-8856
work page 2024
-
[78]
Learning to summarize with human feedback,
N. Stiennon et al., “Learning to summarize with human feedback,”Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 3008–3021, 2020
work page 2020
-
[79]
S. Casper et al., “Open problems and fundamental limitations of reinforcement learning from human feedback,”arXiv preprint arXiv:2307.15217, 2023
-
[80]
Safe RLHF: Safe Reinforcement Learning from Human Feedback
J. Dai et al., “Safe RLHF: Safe reinforcement learning from human feedback,”arXiv preprint arXiv:2310.12773, 2023. 106
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.