ForeMoE uses routing foresight from the rollout stage to enable micro-step load balancing in MoE RL post-training via a hierarchical planner and transfer engine, claiming up to 1.45x speedup on 64 GPUs.
hub Canonical reference
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values, further raising the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (CoT) tasks. However, existing frameworks commonly face challenges such as inference bottlenecks and complexity barriers, which restrict their accessibility to newcomers. To bridge this gap, we introduce \textbf{OpenRLHF}, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency, with speedups ranging from 1.22x to 1.68x across different model sizes, compared to state-of-the-art frameworks. Additionally, it requires significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SAGC dynamically adjusts group sizes in synchronous GRPO and DAPO via online constrained optimization to cut stragglers, improve wall-clock speed, and maintain or improve rewards and downstream reasoning performance.
SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.
CARL trains a critic for segment-level credit assignment from binary outcomes in LLM tool-use trajectories, yielding 6.7-9.7 point accuracy gains and 53% fewer calls on solvable questions across five benchmarks.
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.
Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
EXPO-SQL improves Text-to-SQL by using clause-level rewards derived from execution error messages and incremental clause execution instead of uniform query-level rewards.
Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-policy baselines on agentic tasks.
RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
SSLogic uses LLM agents in a closed Generate-Validate-Refine loop to evolve 953 logic task families from 400 seeds, producing data that yields benchmark gains of +5.2 on SynLogic, +3.0 on AIME25, and +5.5 on BBH.
HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
Autocurriculum decomposition for semiautomata simulation achieves 2^O(sqrt(log T)) sample complexity under interactive feedback and relaxes reference model coverage to block length B << T under RLVR, versus Omega(T) for direct methods.
Presents PyGeoX DSL and 300-problem benchmark, identifies outlier gradient masking under global rewards, and shows Saturating Additive Rewards improve hard-tier solving rate by 2.3x with an 8B model competitive to larger systems.
Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.
AgentJet presents a decoupled multi-node swarm architecture for LLM agent RL that enables heterogeneous multi-model training, multi-task isolation, fault tolerance, live code iteration, context-optimized training, and an autonomous research system.
DeltaBox achieves 14 ms checkpoint and 5 ms rollback for AI agent sandboxes via layered DeltaFS and incremental DeltaCR mechanisms that exploit similarity between consecutive states.
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
DualKV eliminates redundant prompt replication in RL training attention kernels via fused dual-KV CUDA operations and token repacking, delivering 1.63-3.82x policy-update speedups while remaining mathematically equivalent to standard attention.
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
citing papers explorer
-
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments
HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.
-
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
-
Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning
A periodically asynchronous on-policy RL system for LLM post-training achieves up to 3x throughput gains by separating inference and training with periodic policy synchronization and a tri-model architecture.
-
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt similarity observations.
-
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
RLBoost harvests preemptible GPUs for RL rollout via a hybrid architecture with adaptive offload, pull-based transfer, and token-level migration, delivering 1.51x-1.97x throughput and 28-49% better cost efficiency than on-demand-only setups.
-
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
-
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
-
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.
-
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.
-
Supervising the search process produces reliable and generalizable information-seeking agents
Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
-
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Learning to Reason at the Frontier of Learnability
A curriculum sampling questions with high variance in success rate improves reinforcement learning performance for LLM reasoning tasks.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
- Reinforcement Learning from Human Feedback