Recognition: no theorem link
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Pith reviewed 2026-05-15 03:24 UTC · model grok-4.3
The pith
OpenRLHF delivers a streamlined open-source framework for RLHF that trains models 1.22x to 1.68x faster while requiring far fewer lines of code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenRLHF is an easy-to-use RLHF framework that integrates Ray for distributed execution, vLLM for efficient inference, DeepSpeed for optimization, and HuggingFace Transformers for model management. Its simplified architecture removes common inference bottlenecks and coding complexity, yielding measured training speedups of 1.22x to 1.68x compared with state-of-the-art alternatives and substantially lower lines of code for equivalent implementations.
What carries the argument
The combined Ray-vLLM-DeepSpeed-HuggingFace stack that handles distributed RLHF training, fast inference, and model optimization in one integrated pipeline.
If this is right
- RLHF experiments on large models become feasible for smaller teams without heavy custom engineering.
- Faster iteration cycles allow more rapid testing of alignment techniques on reasoning and long-context tasks.
- Widespread adoption reduces duplicated effort in building RLHF pipelines across research groups.
- The same infrastructure supports both human-feedback and verifiable-reward training loops.
Where Pith is reading between the lines
- Lower entry costs may increase the number of independent groups able to run large-scale alignment experiments.
- Efficiency gains could shift compute budgets toward larger batch sizes or longer training runs rather than infrastructure overhead.
- If the code simplicity generalizes, similar layered designs might appear in other LLM training domains such as continued pretraining.
Load-bearing premise
The reported speedups and code reductions were measured against truly comparable, fully optimized versions of existing frameworks on equivalent hardware.
What would settle it
A side-by-side run on the same hardware and task showing that an existing framework matches or exceeds the 1.22x-1.68x speedup while using the same or fewer lines of code for setup.
read the original abstract
Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values, further raising the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (CoT) tasks. However, existing frameworks commonly face challenges such as inference bottlenecks and complexity barriers, which restrict their accessibility to newcomers. To bridge this gap, we introduce \textbf{OpenRLHF}, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency, with speedups ranging from 1.22x to 1.68x across different model sizes, compared to state-of-the-art frameworks. Additionally, it requires significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OpenRLHF, an open-source RLHF/RLVR framework built on Ray, vLLM, DeepSpeed, and HuggingFace Transformers. It claims a simplified design that reduces inference bottlenecks and complexity barriers, with experimental results demonstrating 1.22x–1.68x training speedups over state-of-the-art frameworks across model sizes and significantly fewer lines of code for implementation. The framework is released at https://github.com/OpenRLHF/OpenRLHF and reported to have been adopted by leading institutions.
Significance. If the efficiency and usability claims hold under reproducible conditions, OpenRLHF could meaningfully lower the barrier for RLHF research, particularly for reasoning and long-context tasks, by providing a more accessible alternative to existing frameworks. The open-source release with documentation is a concrete strength that supports broader adoption.
major comments (3)
- [Abstract / Experimental Results] Abstract and Experimental Results section: the central claim of 1.22x–1.68x speedups is presented without any description of the experimental setup, including model sizes tested, hardware (e.g., GPU count and type), baseline framework versions and configurations, batch sizes, or whether equivalent vLLM/DeepSpeed/Ray optimizations were applied to the comparison frameworks. This leaves open the possibility that reported gains reflect differences in tuning rather than architectural advantages.
- [Abstract / Experimental Results] Abstract and Experimental Results section: no information is provided on the number of runs, variance, or statistical significance of the speedup numbers, nor on profiling data showing where bottlenecks shifted (e.g., inference vs. communication). Without these, the efficiency superiority claim cannot be evaluated as load-bearing evidence.
- [Abstract] Abstract: the 'significantly fewer lines of code' ease-of-use metric is stated without defining the scope of code counted (core RLHF loop vs. full pipeline including data handling and evaluation), the exact baseline implementations compared, or any accounting for user extensions and debugging effort required in practice.
minor comments (1)
- [Abstract] The abstract mentions adoption by 'leading institutions' without naming them or providing references; this claim would benefit from either removal or concrete citations if they exist in the full text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for improving the clarity and rigor of our claims. We address each major comment below and will revise the manuscript to incorporate additional details and clarifications.
read point-by-point responses
-
Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the central claim of 1.22x–1.68x speedups is presented without any description of the experimental setup, including model sizes tested, hardware (e.g., GPU count and type), baseline framework versions and configurations, batch sizes, or whether equivalent vLLM/DeepSpeed/Ray optimizations were applied to the comparison frameworks. This leaves open the possibility that reported gains reflect differences in tuning rather than architectural advantages.
Authors: We agree that a complete description of the experimental setup is necessary to substantiate the speedup claims and enable fair evaluation. In the revised manuscript, we will expand the Experimental Results section with explicit details on model sizes (7B, 13B, and 70B), hardware (8x NVIDIA A100-80GB GPUs), baseline framework versions and configurations, batch sizes, and confirmation that equivalent vLLM, DeepSpeed, and Ray optimizations were enabled in the comparison frameworks. The reported speedups stem from our architecture's reduction of inference bottlenecks via integrated Ray-vLLM scheduling rather than from differential tuning. revision: yes
-
Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: no information is provided on the number of runs, variance, or statistical significance of the speedup numbers, nor on profiling data showing where bottlenecks shifted (e.g., inference vs. communication). Without these, the efficiency superiority claim cannot be evaluated as load-bearing evidence.
Authors: We acknowledge the value of statistical reporting. The speedup numbers represent averages across multiple runs; we will add this information (typically 3–5 runs per configuration) along with standard deviations in the revised Experimental Results section. Available profiling data will also be included to show that the primary gains come from reduced inference latency, with communication overheads remaining comparable across frameworks. Full statistical significance tests will be reported where the data permit. revision: partial
-
Referee: [Abstract] Abstract: the 'significantly fewer lines of code' ease-of-use metric is stated without defining the scope of code counted (core RLHF loop vs. full pipeline including data handling and evaluation), the exact baseline implementations compared, or any accounting for user extensions and debugging effort required in practice.
Authors: The lines-of-code comparison is limited to the core RLHF/RLVR training loop (actor-critic update, reward modeling, and PPO/GRPO steps), excluding data loading, evaluation, and user-facing scripts. We compared against the primary implementation modules in TRL and DeepSpeed-Chat. In the revision we will explicitly define this scope, provide the exact baseline repositories and line counts, and note that while our framework lowers the initial implementation barrier, real-world extensions may still require user-specific debugging effort. revision: yes
Circularity Check
No circularity; empirical speedups rest on external framework comparisons
full rationale
The paper introduces an open-source RLHF framework (OpenRLHF) built on Ray, vLLM, DeepSpeed, and HuggingFace Transformers. Its central claims are experimental: measured wall-clock speedups of 1.22x–1.68x versus other frameworks and reduced lines of code. These rest on direct benchmarking against external systems rather than any internal derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes appear; the manuscript contains no mathematical derivation chain that could reduce to its own inputs by construction. Self-citations (if present) are incidental and not load-bearing for the efficiency results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ray, vLLM, DeepSpeed, and HuggingFace Transformers operate as documented and can be integrated without unexpected performance loss.
Forward citations
Cited by 23 Pith papers
-
AIS: Adaptive Importance Sampling for Quantized RL
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
-
Variance-aware Reward Modeling with Anchor Guidance
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
-
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
-
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
-
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.
-
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.
-
PriorZero: Bridging Language Priors and World Models for Decision Making
PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
-
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
-
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
-
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
-
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.
-
Mitigating LLM biases toward spurious social contexts using direct preference optimization
Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.
-
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
-
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.
Reference graph
Works this paper leans on
-
[1]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017
work page 2017
-
[2]
Learning to summarize with human feedback
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020
work page 2020
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Exploring data scaling trends and effects in reinforcement learning from human feedback
Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, and Lin Yan. Exploring data scaling trends and effects in reinforcement learning from human feedback. arXiv preprint arXiv:2503.22230, 2025
-
[5]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
A General Language Assistant as a Laboratory for Alignment
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales
Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, et al. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320, 2023
-
[8]
Trl: Transformer reinforcement learning
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. Trl: Transformer reinforcement learning. https://github. com/huggingface/trl, 2020
work page 2020
-
[9]
Colossal-ai: A unified deep learning system for large-scale parallel training
Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. Colossal-ai: A unified deep learning system for large-scale parallel training. InProceedings of the 52nd International Conference on Parallel Processing, pages 766–775, 2023
work page 2023
-
[10]
Nemo-aligner: Scalable toolkit for efficient model alignment, 2024
Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, and Oleksii Kuchaiev. Nemo-aligner: Scalable toolkit for efficient model alignment, 2024
work page 2024
-
[11]
chatlearn.https://github.com/alibaba/ChatLearn, 2017
alibaba. chatlearn.https://github.com/alibaba/ChatLearn, 2017
work page 2017
-
[12]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Rllib: Abstractions for distributed reinforcement learning
Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. Rllib: Abstractions for distributed reinforcement learning. InInternational conference on machine learning, pages 3053–3062. PMLR, 2018
work page 2018
-
[14]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the- ar...
work page 2020
-
[15]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 8
work page 2022
-
[17]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[19]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023
work page 2023
-
[20]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[21]
Zero: Memory optimiza- tions toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020
work page 2020
-
[22]
Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017
work page 2017
-
[23]
Ad- vanced tricks for training large language models with proximal policy optimiza- tion
Wei Shen, Jian Hu, Pengyu Zhao, Xiaonan He, and Lichang Chen. Ad- vanced tricks for training large language models with proximal policy optimiza- tion. https://swtheking.notion.site/eb7b2d1891f44b3a84e7396d19d39e6f?v= 01bcb084210149488d730064cbabc99f&pvs=74, 2024. Notion Blog
work page 2024
-
[24]
Shashank Mohan Jain. Hugging face. InIntroduction to transformers for NLP: With the hugging face library and models to solve problems, pages 51–67. Springer, 2022
work page 2022
-
[25]
Ring Attention with Blockwise Transformers for Near-Infinite Context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Pytorch.Programming with TensorFlow: solution for edge computing applications, pages 87–104, 2021
Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachidambaresan. Pytorch.Programming with TensorFlow: solution for edge computing applications, pages 87–104, 2021
work page 2021
-
[28]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Kezhao Liu, Jason Klein Liu, Mingtao Chen, and Yiming Liu. Rethinking kl regularization in rlhf: From value estimation to gradient optimization.arXiv preprint arXiv:2510.01555, 2025
-
[30]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023. 9 A Full Contributors A more complete list can be found in the OpenRLHF commit and release history. Ray Integr...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.