hub Canonical reference

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang · 2024 · cs.AI · arXiv 2405.11143

Canonical reference. 75% of citing Pith papers cite this work as background.

56 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 56 citing papers arXiv PDF

abstract

Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values, further raising the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (CoT) tasks. However, existing frameworks commonly face challenges such as inference bottlenecks and complexity barriers, which restrict their accessibility to newcomers. To bridge this gap, we introduce \textbf{OpenRLHF}, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency, with speedups ranging from 1.22x to 1.68x across different model sizes, compared to state-of-the-art frameworks. Additionally, it requires significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 method 2

citation-polarity summary

background 9 use method 2 unclear 1

representative citing papers

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

SAGC dynamically adjusts group sizes in synchronous GRPO and DAPO via online constrained optimization to cut stragglers, improve wall-clock speed, and maintain or improve rewards and downstream reasoning performance.

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference

cs.DC · 2026-05-27 · unverdicted · novelty 7.0

SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.

Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

CARL trains a critic for segment-level credit assignment from binary outcomes in LLM tool-use trajectories, yielding 6.7-9.7 point accuracy gains and 53% fewer calls on solvable questions across five benchmarks.

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.

AIS: Adaptive Importance Sampling for Quantized RL

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

Variance-aware Reward Modeling with Anchor Guidance

stat.ML · 2026-05-12 · unverdicted · novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

cs.SE · 2026-05-06 · unverdicted · novelty 7.0

Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.

EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

EXPO-SQL improves Text-to-SQL by using clause-level rewards derived from execution error messages and incremental clause execution instead of uniform query-level rewards.

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-policy baselines on agentic tasks.

RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

cs.AI · 2026-02-05 · unverdicted · novelty 7.0

RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.

Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning

cs.AI · 2026-01-23 · unverdicted · novelty 7.0

SSLogic uses LLM agents in a closed Generate-Validate-Refine loop to evolve 953 logic task families from 400 seeds, producing data that yields benchmark gains of +5.2 on SynLogic, +3.0 on AIME25, and +5.5 on BBH.

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

cs.DC · 2025-12-13 · unverdicted · novelty 7.0

HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

cs.LG · 2025-07-21 · unverdicted · novelty 7.0

An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.

Learning to Reason with Curriculum II: Compositional Generalization

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

Autocurriculum decomposition for semiautomata simulation achieves 2^O(sqrt(log T)) sample complexity under interactive feedback and relaxes reference model coverage to block length B << T under RLVR, versus Omega(T) for direct methods.

Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Presents PyGeoX DSL and 300-problem benchmark, identifies outlier gradient masking under global rewards, and shows Saturating Additive Rewards improve hard-tier solving rate by 2.3x with an 8B model competitive to larger systems.

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

cs.LG · 2026-06-03 · conditional · novelty 6.0

Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.

AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

cs.AI · 2026-06-03 · unverdicted · novelty 6.0

AgentJet presents a decoupled multi-node swarm architecture for LLM agent RL that enables heterogeneous multi-model training, multi-task isolation, fault tolerance, live code iteration, context-optimized training, and an autonomous research system.

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

cs.OS · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

DeltaBox achieves 14 ms checkpoint and 5 ms rollback for AI agent sandboxes via layered DeltaFS and incremental DeltaCR mechanisms that exploit similarity between consecutive states.

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

cs.LG · 2026-05-14 · unverdicted · novelty 6.0 · 2 refs

DualKV eliminates redundant prompt replication in RL training attention kernels via fused dual-KV CUDA operations and token repacking, delivering 1.63-3.82x policy-update speedups while remaining mathematically equivalent to standard attention.

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

cs.AI · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

PriorZero: Bridging Language Priors and World Models for Decision Making

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.

citing papers explorer

Showing 11 of 11 citing papers after filters.

RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training cs.AI · 2026-02-05 · unverdicted · none · ref 8 · internal anchor
RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning cs.AI · 2026-01-23 · unverdicted · none · ref 3 · internal anchor
SSLogic uses LLM agents in a closed Generate-Validate-Refine loop to evolve 953 logic task families from 400 seeds, producing data that yields benchmark gains of +5.2 on SynLogic, +3.0 on AIME25, and +5.5 on BBH.
AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning cs.AI · 2026-06-03 · unverdicted · none · ref 5 · internal anchor
AgentJet presents a decoupled multi-node swarm architecture for LLM agent RL that enables heterogeneous multi-model training, multi-task isolation, fault tolerance, live code iteration, context-optimized training, and an autonomous research system.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models cs.AI · 2026-05-13 · unverdicted · none · ref 12 · 2 links · internal anchor
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 10 · 2 links · internal anchor
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable cs.AI · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism cs.AI · 2026-05-03 · unverdicted · none · ref 6 · internal anchor
PAT adaptively reconfigures tensor parallelism in RLHF generation using predictor-guided decisions and lightweight state updates, cutting generation latency by up to 34.6%.
Mitigating LLM biases toward spurious social contexts using direct preference optimization cs.AI · 2026-04-02 · unverdicted · none · ref 12 · internal anchor
Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning cs.AI · 2025-03-18 · conditional · none · ref 21 · internal anchor
Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 271 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence cs.AI · 2026-05-07 · unverdicted · none · ref 63 · 2 links · internal anchor
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer