pith. machine review for the scientific record. sign in

arxiv: 2405.11143 · v6 · submitted 2024-05-20 · 💻 cs.AI · cs.CL· cs.LG

Recognition: no theorem link

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu , Xibin Wu , Wei Shen , Jason Klein Liu , Zilin Zhu , Weixun Wang , Songlin Jiang , Haoran Wang , Hao Chen , Bin Chen , Weikai Fang , Xianyu , Yu Cao , Haotian Xu , Yiming Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords OpenRLHFRLHFLLM alignmentreinforcement learningtraining frameworkscalabilityopen sourceefficiency
0
0 comments X

The pith

OpenRLHF delivers a streamlined open-source framework for RLHF that trains models 1.22x to 1.68x faster while requiring far fewer lines of code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenRLHF as an accessible framework for fine-tuning large language models through Reinforcement Learning from Human Feedback and Reinforcement Learning with Verifiable Rewards. Built on Ray, vLLM, DeepSpeed, and HuggingFace Transformers, it simplifies the pipeline with clear code structure and documentation aimed at newcomers. Tests across model sizes show consistent speedups over prior frameworks, plus reduced coding effort for setup and execution. This targets the practical barriers in aligning AI systems with human values on complex reasoning and long-context tasks. The design prioritizes scalability without sacrificing performance for typical research workloads.

Core claim

OpenRLHF is an easy-to-use RLHF framework that integrates Ray for distributed execution, vLLM for efficient inference, DeepSpeed for optimization, and HuggingFace Transformers for model management. Its simplified architecture removes common inference bottlenecks and coding complexity, yielding measured training speedups of 1.22x to 1.68x compared with state-of-the-art alternatives and substantially lower lines of code for equivalent implementations.

What carries the argument

The combined Ray-vLLM-DeepSpeed-HuggingFace stack that handles distributed RLHF training, fast inference, and model optimization in one integrated pipeline.

If this is right

  • RLHF experiments on large models become feasible for smaller teams without heavy custom engineering.
  • Faster iteration cycles allow more rapid testing of alignment techniques on reasoning and long-context tasks.
  • Widespread adoption reduces duplicated effort in building RLHF pipelines across research groups.
  • The same infrastructure supports both human-feedback and verifiable-reward training loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Lower entry costs may increase the number of independent groups able to run large-scale alignment experiments.
  • Efficiency gains could shift compute budgets toward larger batch sizes or longer training runs rather than infrastructure overhead.
  • If the code simplicity generalizes, similar layered designs might appear in other LLM training domains such as continued pretraining.

Load-bearing premise

The reported speedups and code reductions were measured against truly comparable, fully optimized versions of existing frameworks on equivalent hardware.

What would settle it

A side-by-side run on the same hardware and task showing that an existing framework matches or exceeds the 1.22x-1.68x speedup while using the same or fewer lines of code for setup.

read the original abstract

Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values, further raising the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (CoT) tasks. However, existing frameworks commonly face challenges such as inference bottlenecks and complexity barriers, which restrict their accessibility to newcomers. To bridge this gap, we introduce \textbf{OpenRLHF}, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency, with speedups ranging from 1.22x to 1.68x across different model sizes, compared to state-of-the-art frameworks. Additionally, it requires significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces OpenRLHF, an open-source RLHF/RLVR framework built on Ray, vLLM, DeepSpeed, and HuggingFace Transformers. It claims a simplified design that reduces inference bottlenecks and complexity barriers, with experimental results demonstrating 1.22x–1.68x training speedups over state-of-the-art frameworks across model sizes and significantly fewer lines of code for implementation. The framework is released at https://github.com/OpenRLHF/OpenRLHF and reported to have been adopted by leading institutions.

Significance. If the efficiency and usability claims hold under reproducible conditions, OpenRLHF could meaningfully lower the barrier for RLHF research, particularly for reasoning and long-context tasks, by providing a more accessible alternative to existing frameworks. The open-source release with documentation is a concrete strength that supports broader adoption.

major comments (3)
  1. [Abstract / Experimental Results] Abstract and Experimental Results section: the central claim of 1.22x–1.68x speedups is presented without any description of the experimental setup, including model sizes tested, hardware (e.g., GPU count and type), baseline framework versions and configurations, batch sizes, or whether equivalent vLLM/DeepSpeed/Ray optimizations were applied to the comparison frameworks. This leaves open the possibility that reported gains reflect differences in tuning rather than architectural advantages.
  2. [Abstract / Experimental Results] Abstract and Experimental Results section: no information is provided on the number of runs, variance, or statistical significance of the speedup numbers, nor on profiling data showing where bottlenecks shifted (e.g., inference vs. communication). Without these, the efficiency superiority claim cannot be evaluated as load-bearing evidence.
  3. [Abstract] Abstract: the 'significantly fewer lines of code' ease-of-use metric is stated without defining the scope of code counted (core RLHF loop vs. full pipeline including data handling and evaluation), the exact baseline implementations compared, or any accounting for user extensions and debugging effort required in practice.
minor comments (1)
  1. [Abstract] The abstract mentions adoption by 'leading institutions' without naming them or providing references; this claim would benefit from either removal or concrete citations if they exist in the full text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the clarity and rigor of our claims. We address each major comment below and will revise the manuscript to incorporate additional details and clarifications.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the central claim of 1.22x–1.68x speedups is presented without any description of the experimental setup, including model sizes tested, hardware (e.g., GPU count and type), baseline framework versions and configurations, batch sizes, or whether equivalent vLLM/DeepSpeed/Ray optimizations were applied to the comparison frameworks. This leaves open the possibility that reported gains reflect differences in tuning rather than architectural advantages.

    Authors: We agree that a complete description of the experimental setup is necessary to substantiate the speedup claims and enable fair evaluation. In the revised manuscript, we will expand the Experimental Results section with explicit details on model sizes (7B, 13B, and 70B), hardware (8x NVIDIA A100-80GB GPUs), baseline framework versions and configurations, batch sizes, and confirmation that equivalent vLLM, DeepSpeed, and Ray optimizations were enabled in the comparison frameworks. The reported speedups stem from our architecture's reduction of inference bottlenecks via integrated Ray-vLLM scheduling rather than from differential tuning. revision: yes

  2. Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: no information is provided on the number of runs, variance, or statistical significance of the speedup numbers, nor on profiling data showing where bottlenecks shifted (e.g., inference vs. communication). Without these, the efficiency superiority claim cannot be evaluated as load-bearing evidence.

    Authors: We acknowledge the value of statistical reporting. The speedup numbers represent averages across multiple runs; we will add this information (typically 3–5 runs per configuration) along with standard deviations in the revised Experimental Results section. Available profiling data will also be included to show that the primary gains come from reduced inference latency, with communication overheads remaining comparable across frameworks. Full statistical significance tests will be reported where the data permit. revision: partial

  3. Referee: [Abstract] Abstract: the 'significantly fewer lines of code' ease-of-use metric is stated without defining the scope of code counted (core RLHF loop vs. full pipeline including data handling and evaluation), the exact baseline implementations compared, or any accounting for user extensions and debugging effort required in practice.

    Authors: The lines-of-code comparison is limited to the core RLHF/RLVR training loop (actor-critic update, reward modeling, and PPO/GRPO steps), excluding data loading, evaluation, and user-facing scripts. We compared against the primary implementation modules in TRL and DeepSpeed-Chat. In the revision we will explicitly define this scope, provide the exact baseline repositories and line counts, and note that while our framework lowers the initial implementation barrier, real-world extensions may still require user-specific debugging effort. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical speedups rest on external framework comparisons

full rationale

The paper introduces an open-source RLHF framework (OpenRLHF) built on Ray, vLLM, DeepSpeed, and HuggingFace Transformers. Its central claims are experimental: measured wall-clock speedups of 1.22x–1.68x versus other frameworks and reduced lines of code. These rest on direct benchmarking against external systems rather than any internal derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes appear; the manuscript contains no mathematical derivation chain that could reduce to its own inputs by construction. Self-citations (if present) are incidental and not load-bearing for the efficiency results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the correct functioning of four external libraries and standard assumptions about distributed training infrastructure; no free parameters, new mathematical entities, or ad-hoc axioms are introduced.

axioms (1)
  • domain assumption Ray, vLLM, DeepSpeed, and HuggingFace Transformers operate as documented and can be integrated without unexpected performance loss.
    The entire system is constructed on top of these libraries.

pith-pipeline@v0.9.0 · 5565 in / 1166 out tokens · 46218 ms · 2026-05-15T03:24:18.968024+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AIS: Adaptive Importance Sampling for Quantized RL

    stat.ML 2026-05 unverdicted novelty 7.0

    AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

  2. Variance-aware Reward Modeling with Anchor Guidance

    stat.ML 2026-05 unverdicted novelty 7.0

    Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...

  3. BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...

  4. Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

    cs.SE 2026-05 unverdicted novelty 7.0

    Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.

  5. Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 7.0

    Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...

  6. Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.

  7. MinT: Managed Infrastructure for Training and Serving Millions of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.

  8. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.

  9. PriorZero: Bridging Language Priors and World Models for Decision Making

    cs.LG 2026-05 unverdicted novelty 6.0

    PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.

  10. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...

  11. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...

  12. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

    cs.AI 2026-05 unverdicted novelty 6.0

    CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...

  13. ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

    cs.DC 2026-05 unverdicted novelty 6.0

    ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.

  14. JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 6.0

    JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.

  15. Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

    cs.CL 2026-04 unverdicted novelty 6.0

    Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.

  16. TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

    cs.DC 2026-04 unverdicted novelty 6.0

    TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.

  17. Mitigating LLM biases toward spurious social contexts using direct preference optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.

  18. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    cs.CL 2025-06 conditional novelty 6.0

    High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

  19. PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers

    cs.DC 2026-05 unverdicted novelty 5.0

    PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.

  20. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  21. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 4.0

    Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

  22. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

  23. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 3.0

    Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 21 Pith papers · 10 internal anchors

  1. [1]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  2. [2]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  4. [4]

    Exploring data scaling trends and effects in reinforcement learning from human feedback

    Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, and Lin Yan. Exploring data scaling trends and effects in reinforcement learning from human feedback. arXiv preprint arXiv:2503.22230, 2025

  5. [5]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  6. [6]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

  7. [7]

    Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales

    Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, et al. Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320, 2023

  8. [8]

    Trl: Transformer reinforcement learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. Trl: Transformer reinforcement learning. https://github. com/huggingface/trl, 2020

  9. [9]

    Colossal-ai: A unified deep learning system for large-scale parallel training

    Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, and Yang You. Colossal-ai: A unified deep learning system for large-scale parallel training. InProceedings of the 52nd International Conference on Parallel Processing, pages 766–775, 2023

  10. [10]

    Nemo-aligner: Scalable toolkit for efficient model alignment, 2024

    Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, Markel Sanz Ausin, Ashwath Aithal, and Oleksii Kuchaiev. Nemo-aligner: Scalable toolkit for efficient model alignment, 2024

  11. [11]

    chatlearn.https://github.com/alibaba/ChatLearn, 2017

    alibaba. chatlearn.https://github.com/alibaba/ChatLearn, 2017

  12. [12]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  13. [13]

    Rllib: Abstractions for distributed reinforcement learning

    Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. Rllib: Abstractions for distributed reinforcement learning. InInternational conference on machine learning, pages 3053–3062. PMLR, 2018

  14. [14]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the- ar...

  15. [15]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  16. [16]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 8

  17. [17]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  18. [18]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

  19. [19]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  20. [20]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  21. [21]

    Zero: Memory optimiza- tions toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

  22. [22]

    Openai baselines

    Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017

  23. [23]

    Ad- vanced tricks for training large language models with proximal policy optimiza- tion

    Wei Shen, Jian Hu, Pengyu Zhao, Xiaonan He, and Lichang Chen. Ad- vanced tricks for training large language models with proximal policy optimiza- tion. https://swtheking.notion.site/eb7b2d1891f44b3a84e7396d19d39e6f?v= 01bcb084210149488d730064cbabc99f&pvs=74, 2024. Notion Blog

  24. [24]

    Hugging face

    Shashank Mohan Jain. Hugging face. InIntroduction to transformers for NLP: With the hugging face library and models to solve problems, pages 51–67. Springer, 2022

  25. [25]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

  26. [26]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  27. [27]

    Pytorch.Programming with TensorFlow: solution for edge computing applications, pages 87–104, 2021

    Sagar Imambi, Kolla Bhanu Prakash, and GR Kanagachidambaresan. Pytorch.Programming with TensorFlow: solution for edge computing applications, pages 87–104, 2021

  28. [28]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

  29. [29]

    Rethinking kl regularization in rlhf: From value estimation to gradient optimization.arXiv preprint arXiv:2510.01555, 2025

    Kezhao Liu, Jason Klein Liu, Mingtao Chen, and Yiming Liu. Rethinking kl regularization in rlhf: From value estimation to gradient optimization.arXiv preprint arXiv:2510.01555, 2025

  30. [30]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023. 9 A Full Contributors A more complete list can be found in the OpenRLHF commit and release history. Ray Integr...