arxiv: 2409.19256 · v2 · submitted 2024-09-28 · 💻 cs.LG · cs.DC

Recognition: 2 theorem links

HybridFlow: A Flexible and Efficient RLHF Framework

Chi Zhang, Chuan Wu, Guangming Sheng, Haibin Lin, Ru Zhang, Wang Zhang, Xibin Wu, Yanghua Peng, Zilingfeng Ye

Pith reviewed 2026-05-11 07:47 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords RLHFHybridFlowLLM alignmentdataflow orchestrationsingle-controllermulti-controllermodel reshardingthroughput optimization

0 comments

The pith

HybridFlow combines single-controller and multi-controller paradigms to flexibly represent and efficiently execute RLHF dataflows for large language model alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses inefficiencies in Reinforcement Learning from Human Feedback when applied to large language models. Single-controller systems incur high overhead dispatching commands for distributed computations inside nodes, while multi-controller systems struggle to flexibly handle the many-to-many data dependencies across nodes. HybridFlow introduces a hybrid control model together with hierarchical APIs that separate and encapsulate computation from communication, plus a specialized engine for moving model parameters between training and generation phases. This design allows arbitrary RLHF algorithms to be expressed and run on varied hardware. Experiments report throughput gains ranging from 1.53 times to 20.57 times over prior systems.

Core claim

HybridFlow combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. Hierarchical APIs decouple and encapsulate computation and data dependencies, supporting efficient operation orchestration and flexible device mapping. The 3D-HybridEngine performs actor model resharding between training and generation with zero memory redundancy and reduced communication overhead.

What carries the argument

HybridFlow's hybrid single/multi-controller execution model together with its hierarchical APIs for decoupling computation from data dependencies and the 3D-HybridEngine for zero-redundancy model resharding.

If this is right

RLHF algorithms gain both flexible representation of complex data dependencies and reduced control-dispatch overhead during distributed execution.
Actor models can transition between training and generation phases without memory duplication or high communication costs.
The same framework supports multiple RLHF variants by swapping only the orchestration logic expressed through the hierarchical APIs.
Device mapping of computation nodes becomes independent of the dataflow representation, enabling use across different cluster sizes and hardware types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hybrid control pattern may apply to other machine-learning pipelines that alternate between distributed training and inference phases.
Developers could reuse the hierarchical APIs to prototype new RLHF variants without rewriting low-level communication code.
If the resharding engine generalizes beyond the actor model, similar zero-redundancy techniques could reduce memory pressure in other multi-stage LLM workflows.

Load-bearing premise

The hierarchical APIs and 3D-HybridEngine can be implemented with negligible overhead while supporting arbitrary RLHF algorithms and hardware without introducing new bottlenecks or correctness issues in the dataflow orchestration.

What would settle it

Running the same set of RLHF algorithms on identical hardware setups with HybridFlow yields no measurable throughput improvement or produces incorrect model outputs or training divergence.

read the original abstract

Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs. RLHF complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge into a many-to-many multicast. Traditional RL frameworks execute the dataflow using a single controller to instruct both intra-node computation and inter-node communication, which can be inefficient in RLHF due to large control dispatch overhead for distributed intra-node computation. Existing RLHF systems adopt a multi-controller paradigm, which can be inflexible due to nesting distributed computation and data communication. We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierarchical APIs that decouple and encapsulate computation and data dependencies in the complex RLHF dataflow, allowing efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various devices. We further design a 3D-HybridEngine for efficient actor model resharding between training and generation phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results demonstrate 1.53$\times$~20.57$\times$ throughput improvement when running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source code will be available at https://github.com/volcengine/verl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes HybridFlow, a hybrid RLHF framework that merges single-controller and multi-controller paradigms via hierarchical APIs to flexibly represent and efficiently execute complex RLHF dataflows (with nodes as distributed NN computations and edges as many-to-many multicasts). It introduces a 3D-HybridEngine for zero-redundancy actor-model resharding between training and generation phases. Experiments report 1.53×–20.57× throughput gains over state-of-the-art baselines for various RLHF algorithms, with code to be open-sourced.

Significance. If the throughput claims and generality hold, HybridFlow would meaningfully advance practical RLHF systems by addressing control overhead and inflexibility in distributed settings, offering a reusable abstraction layer that could accelerate development of new alignment algorithms while improving hardware utilization. The planned code release supports reproducibility.

major comments (2)

[Abstract] Abstract and experimental results: the central throughput claim (1.53×–20.57×) is presented without any description of baselines, hardware specifications, workload details, or ablation studies isolating the hybrid control and 3D-HybridEngine contributions from other optimizations; this makes it impossible to verify whether gains stem from the proposed paradigm or implementation specifics.
[Section 3] Design of hierarchical APIs and 3D-HybridEngine (Section 3): the manuscript asserts negligible orchestration overhead and correctness for arbitrary RLHF dataflows (including non-standard algorithms, heterogeneous hardware, and complex multicast patterns), yet provides neither quantitative overhead measurements, machine-checked invariants, nor edge-case coverage beyond the reported cases; this leaves the weakest assumption untested and risks the gains being non-generalizable.

minor comments (1)

[Abstract] The notation '1.53×~20.57×' in the abstract is ambiguous; replace with '1.53× to 20.57×' or specify the exact range and conditions.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications from the full paper and indicating revisions where appropriate to improve verifiability and completeness.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results: the central throughput claim (1.53×–20.57×) is presented without any description of baselines, hardware specifications, workload details, or ablation studies isolating the hybrid control and 3D-HybridEngine contributions from other optimizations; this makes it impossible to verify whether gains stem from the proposed paradigm or implementation specifics.

Authors: We agree the abstract is concise and omits these details. The full manuscript's Section 4 (Experiments) specifies the baselines (DeepSpeed-Chat, vLLM+DeepSpeed, and Hugging Face TRL), hardware (NVIDIA A100 GPU clusters with 8-32 GPUs), workloads (PPO, DPO, GRPO on Llama-7B/13B models with standard datasets), and reports throughput under identical conditions. To isolate contributions, we have added new ablation studies in the revised Section 4.3 that separately measure the hybrid control paradigm and 3D-HybridEngine effects. We will also append a brief evaluation summary to the abstract. revision: yes
Referee: [Section 3] Design of hierarchical APIs and 3D-HybridEngine (Section 3): the manuscript asserts negligible orchestration overhead and correctness for arbitrary RLHF dataflows (including non-standard algorithms, heterogeneous hardware, and complex multicast patterns), yet provides neither quantitative overhead measurements, machine-checked invariants, nor edge-case coverage beyond the reported cases; this leaves the weakest assumption untested and risks the gains being non-generalizable.

Authors: The manuscript reports orchestration overhead measurements in Section 4.2 (under 5% of runtime for tested cases) and validates correctness empirically across multiple RLHF algorithms with multicast patterns. We will expand Section 3 with additional quantitative overhead data for heterogeneous hardware and more complex non-standard dataflows, plus a new subsection on edge-case coverage and limitations. Machine-checked invariants are outside the scope of this systems paper, which relies on empirical validation and open-source code for reproducibility rather than formal methods. revision: partial

standing simulated objections not resolved

Providing machine-checked invariants or formal proofs of correctness for arbitrary RLHF dataflows under the hierarchical APIs and 3D-HybridEngine.

Circularity Check

0 steps flagged

No circularity: claims rest on architecture design and independent runtime measurements

full rationale

The paper describes a hybrid RLHF framework with hierarchical APIs and 3D-HybridEngine for dataflow orchestration. No mathematical derivation chain, fitted parameters, or predictions exist; throughput gains (1.53×–20.57×) are reported from direct experiments on various algorithms rather than any self-referential equations or self-citation load-bearing uniqueness theorems. The central claims about flexibility and efficiency are supported by system implementation details and empirical benchmarks that do not reduce to the inputs by construction. This is a standard systems paper with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper is an engineering systems contribution; the central claim does not rest on fitted parameters, unproven mathematical axioms, or new physical entities. It relies on standard assumptions about distributed GPU clusters and LLM training workloads.

axioms (1)

domain assumption Standard assumptions about distributed computing environments and neural network training dynamics hold for the target hardware and workloads.
The framework design and performance claims presuppose typical properties of GPU clusters and LLM training without stating exceptions.

invented entities (1)

3D-HybridEngine no independent evidence
purpose: Efficient actor model resharding between training and generation phases with zero memory redundancy
New software component introduced to solve the resharding problem; no independent evidence outside the paper's experiments is provided.

pith-pipeline@v0.9.0 · 5607 in / 1402 out tokens · 84856 ms · 2026-05-11T07:47:58.870354+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
cs.CV 2026-04 unverdicted novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 7.0

BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
cs.AI 2026-05 unverdicted novelty 7.0

An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
cs.LG 2026-05 unverdicted novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
cs.LG 2026-05 unverdicted novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
Teaching Language Models to Think in Code
cs.CL 2026-05 unverdicted novelty 7.0

ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States
cs.CL 2026-05 unverdicted novelty 7.0

SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
cs.LG 2026-04 unverdicted novelty 7.0

SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 7.0

MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
DeonticBench: A Benchmark for Reasoning over Rules
cs.CL 2026-04 unverdicted novelty 7.0

DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
cs.LG 2026-04 unverdicted novelty 7.0

TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
Think Anywhere in Code Generation
cs.SE 2026-03 unverdicted novelty 7.0

Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
cs.LG 2026-05 unverdicted novelty 6.0

FG-ExPO improves GRPO by adaptively scaling the KL penalty with batch accuracy and sampling questions via a Gaussian centered at 0.5 accuracy, delivering up to 13.34 point gains on AIME 2025 pass@32.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

BoostAPR uses supervised fine-tuning on verified fixes, dual sequence- and line-level reward models from execution feedback, and PPO to reach 40.7% on SWE-bench Verified with strong cross-language results.
Internalizing Safety Understanding in Large Reasoning Models via Verification
cs.AI 2026-05 unverdicted novelty 6.0

Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment tha...
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
cs.LG 2026-05 unverdicted novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
Teaching Language Models to Think in Code
cs.CL 2026-05 unverdicted novelty 6.0

ThinC trains smaller language models to reason entirely in code after minimal NL planning, outperforming tool-integrated baselines and even much larger models on competition math benchmarks.
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
cs.DC 2026-05 unverdicted novelty 6.0

ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
cs.CL 2026-05 unverdicted novelty 6.0

A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges p...
RVPO: Risk-Sensitive Alignment via Variance Regularization
cs.LG 2026-05 unverdicted novelty 6.0

RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.
Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL
cs.CL 2026-05 unverdicted novelty 6.0

FineStep adds step-level process rewards and credit assignment to tool-augmented Text-to-SQL, achieving 3.25% higher execution accuracy than GRPO on BIRD while cutting redundant tool calls.
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
When Less is Enough: Efficient Inference via Collaborative Reasoning
cs.LG 2026-05 conditional novelty 6.0

A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems
cs.RO 2026-04 unverdicted novelty 6.0

Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
cs.CL 2026-04 unverdicted novelty 6.0

Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
MemReader: From Passive to Active Extraction for Long-Term Agent Memory
cs.CL 2026-04 unverdicted novelty 6.0

MemReader uses distilled passive and GRPO-trained active extractors to selectively write low-noise long-term memories, outperforming passive baselines on knowledge updating, temporal reasoning, and hallucination tasks.
Target Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

SciTikZer-8B uses a new dataset, benchmark, and dual self-consistency RL to generate TikZ code for scientific graphics, outperforming much larger models like Gemini-2.5-Pro.
COSMO-Agent: Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration
cs.AI 2026-04 unverdicted novelty 6.0

COSMO-Agent trains LLMs via tool-augmented RL and a multi-constraint reward to close the CAD-CAE loop, with experiments showing small open-source models outperforming larger ones on feasibility and stability for 25 co...
TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning
cs.SE 2026-04 unverdicted novelty 6.0

By proving test suite coverage is monotone submodular and training LLMs with RL to maximize marginal gains, TestDecision improves branch coverage 38-52% and bug detection up to 95% over base models on ULT and LiveCodeBench.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
cs.CL 2025-06 conditional novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
cs.LG 2026-05 unverdicted novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
cs.LG 2026-05 unverdicted novelty 5.0

Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
cs.AI 2026-05 unverdicted novelty 5.0

An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
cs.AI 2026-05 unverdicted novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
cs.CV 2026-05 unverdicted novelty 5.0

LensVLM trains VLMs to scan compressed rendered text images and selectively expand task-relevant regions, achieving 4.3x compression with near full-text accuracy and outperforming baselines up to 10.1x on text QA benchmarks.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
An End-to-End Framework for Building Large Language Models for Software Operations
cs.LG 2026-04 unverdicted novelty 5.0

OpsLLM outperforms general LLMs on software operations QA and RCA tasks through human-in-the-loop data curation, supervised fine-tuning, and domain-specific reinforcement learning.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 61 Pith papers · 13 internal anchors

[1]

Martín Abadi. 2016. TensorFlow: learning functions at scale. InProceed- ings of the 21st ACM SIGPLAN international conference on functional programming. 1–1

work page 2016
[2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al . 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhar- gav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369 (2023)

work page arXiv 2023
[4]

Riad Akrour, Marc Schoenauer, and Michele Sebag. 2011. Preference- based policy learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011. Proceedings, Part I 11 . Springer, 12–27

work page 2011
[5]

Gene M Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. InProceedings of the April 18-20, 1967, spring joint computer conference . 483–485

work page 1967
[6]

George E Andrews and Kimmo Eriksson. 2004. Integer partitions . Cambridge University Press

work page 2004
[7]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al . 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. {PipeSwitch}: Fast pipelined context switching for deep learning applications. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 499–514

work page 2020
[9]

Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, et al . 2022. Pathways: Asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems 4 (2022), 430–449

work page 2022
[10]

Eric Temple Bell. 1934. Exponential polynomials. Annals of Mathe- matics (1934), 258–277

work page 1934
[11]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

I. Caspi. 2017. Reinforcement learning coach by Intel . https://github. com/NervanaSystems/coach

work page 2017
[13]

Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. 2007. Collective communication: theory, practice, and expe- rience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 1749–1783

work page 2007
[14]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Collosal-AI Corporation. 2023. Collosal-Chat. https://github.com/ binmakeswell/ColossalChat

work page 2023
[16]

NVIDIA Corporation. 2023. TensorRT-LLM: A TensorRT Toolbox for Op- timized Large Language Model Inference. https://github.com/NVIDIA/ TensorRT-LLM

work page 2023
[17]

NVIDIA Corporation. 2024. NeMo-Aligner: Scalable toolkit for efficient model alignment. https://github.com/NVIDIA/NeMo-Aligner

work page 2024
[18]

Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo. 2022. {DVABatch}: Diversity-aware {Multi- Entry} {Multi-Exit} batching for efficient processing of {DNN} ser- vices on {GPUs }. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 183–198

work page 2022
[19]

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. 2024. Safe RLHF: Safe Reinforcement Learning from Human Feedback. In The Twelfth International Confer- ence on Learning Representations . https://openreview.net/forum?id= TyFrPOKYXw

work page 2024
[20]

Frederica Darema. 2001. The spmd model: Past, present and future. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 8th European PVM/MPI Users’ Group Meeting Santorini/Thera, Greece, September 23–26, 2001 Proceedings 8 . Springer, 1–1

work page 2001
[21]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113

work page 2008
[22]

Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. {Check-N-Run}: A check- pointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 22). 929–943

work page 2022
[23]

Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al . 2021. DAPPLE: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 431–445

work page 2021
[24]

X Yu Geoffrey, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko

work page
[25]

In2021 USENIX Annual Technical Conference (USENIX ATC 21)

Habitat: A {Runtime-Based} computational performance predic- tor for deep neural network training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). 503–521

work page
[26]

Danijar Hafner, James Davidson, and Vincent Vanhoucke. 2017. Ten- sorflow agents: Efficient batched reinforcement learning in tensorflow. arXiv preprint arXiv:1709.02878 (2017)

work page arXiv 2017
[27]

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 539–558

work page 2022
[28]

Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari, Jonathan Tow, Stella Biderman, Quentin Anthony, and Louis Castricato

work page
[29]

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

trlX: A framework for large scale reinforcement learning from human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . 8578–8595

work page 2023
[30]

Hesse, M

C. Hesse, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu

work page
[31]

https://github.com/openai/baselines

OpenAI baselines. https://github.com/openai/baselines

work page
[32]

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al. 2024. DeepSpeed- FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference. arXiv preprint arXiv:2401.08671 (2024)

work page arXiv 2024
[33]

Jian Hu, Xibin Wu, Xianyu, Chen Su, Leon Qiu, Daoning Jiang, Qing Wang, and Weixun Wang. 2023. OpenRLHF: A Ray-based High- performance RLHF framework. https://github.com/OpenLLMAI/ OpenRLHF

work page 2023
[34]

Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, and Lewis Tunstall. 2024. The N+ Implementation Details of RLHF with PPO: A Case Study on TL; DR Summarization. arXiv preprint arXiv:2403.17031 (2024)

work page arXiv 2024
[35]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019). EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands G. Sheng, C. ...

work page 2019
[36]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequen- tial building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European conference on computer systems 2007 . 59–72

work page 2007
[37]

Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowd- hury. 2023. Oobleck: Resilient distributed training of large models using pipeline templates. In Proceedings of the 29th Symposium on Operating Systems Principles. 382–395

work page 2023
[38]

Sylvain Jeaugey. 2017. Nccl 2.0. In GPU Technology Conference (GTC), Vol. 2. 23

work page 2017
[39]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al

work page
[40]

arXiv preprint arXiv:2402.15627 (2024)

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs. arXiv preprint arXiv:2402.15627 (2024)

work page arXiv 2024
[41]

Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier

work page
[42]

arXiv preprint arXiv:2312.14925

A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925 (2023)

work page arXiv 2023
[43]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Sto- chastic Optimization. arXiv:1412.6980 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Kostrikov

I. Kostrikov. 2017. PyTorch implementation of advantage actor critic (A2C), proximal policy optimization (PPO) and scalable trust-region method for deep reinforcement learning. https://github.com/ikostrikov/ pytorch-a2c-ppo-acktr

work page 2017
[45]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page
[46]

InProceedings of the 29th Symposium on Operating Systems Principles

Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

work page
[47]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267 (2023)

work page arXiv 2023
[48]

Cheng Li. 2023. LLM-Analysis: Latency and Memory Analysis of Trans- former Models for Training and Inference. https://github.com/cli99/llm- analysis

work page 2023
[49]

Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. 2023. ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models. arXiv preprint arXiv: 2310.10505 (2023)

work page arXiv 2023
[50]

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gon- zalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) . 663–679

work page 2023
[51]

Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2018. RLlib: Abstractions for distributed reinforcement learning. In Interna- tional conference on machine learning . PMLR, 3053–3062

work page 2018
[52]

Eric Liang, Zhanghao Wu, Michael Luo, Sven Mika, Joseph E Gonzalez, and Ion Stoica. 2021. RLlib Flow: Distributed Reinforcement Learning is a Dataflow Problem. Advances in Neural Information Processing Systems 34 (2021), 5506–5517

work page 2021
[53]

Yun Liang, Huynh Phung Huynh, Kyle Rupnow, Rick Siow Mong Goh, and Deming Chen. 2014. Efficient GPU spatial-temporal multitasking. IEEE Transactions on Parallel and Distributed Systems 26, 3 (2014), 748– 760

work page 2014
[54]

Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xi- aowei Li. 2017. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) . IEEE, 553–564

work page 2017
[55]

Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. {CheckFreq}: Frequent, {Fine-Grained} {DNN} Checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21) . 203–216

work page 2021
[56]

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerg- ing {AI} applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18) . 561–577

work page 2018
[57]

Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 439–455

work page 2013
[58]

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al . 2021. Webgpt: Browser-assisted question- answering with human feedback. arXiv preprint arXiv:2112.09332 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[59]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM symposium on operating sys- tems principles. 1–15

work page 2019
[60]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. In Proceedings of the International Conference for High Performance Computing, Netw...

work page 2021
[61]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al . 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744

work page 2022
[62]

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dy- namic resource management for efficient utilization of multitasking GPUs. In Proceedings of the twenty-second international conference on architectural support for programming languages and operating systems . 527–540

work page 2017
[63]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al . 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural informa- tion processing systems 32 (2019)

work page 2019
[64]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A Generic Commu- nication Scheduler for Distributed DNN Training Acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Prin- ciples. ACM, Huntsville Ontario Canada, 16–29. https://doi.org/10. 1145/3341301.3359642

work page arXiv 2019
[65]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

work page
[66]

In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Zero: Memory optimizations toward training trillion param- eter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 1–16

work page
[67]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He

work page
[68]

In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506

work page
[69]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He

work page
[70]

In 2021 USENIX Annual Technical Conference (USENIX ATC 21)

{Zero-offload}: Democratizing {billion-scale} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) . 551–564

work page 2021
[71]

Gian-Carlo Rota. 1964. The number of partitions of a set.The American Mathematical Monthly 71, 5 (1964), 498–504

work page 1964
[72]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal HybridFlow: A Flexible and Efficient RLHF Framework EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bit- ton, Manish Bhatt, Cristian Canton Ferre...

work page 2025
[73]

Code Llama: Open Foundation Models for Code

Code Llama: Open Foundation Models for Code. arXiv preprint arXiv: 2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen

work page
[75]

Efficient rlhf: Reducing the memory usage of ppo

Efficient RLHF: Reducing the Memory Usage of PPO. arXiv preprint arXiv: 2309.00754 (2023)

work page arXiv 2023
[76]

Hill Kohli Saxton, Grefenstette. 2019. Analysing Mathematical Rea- soning Abilities of Neural Models. arXiv:1904.01557 (2019)

work page arXiv 2019
[77]

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. InInternational conference on machine learning . PMLR, 1889–1897

work page 2015
[78]

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2018. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[79]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[80]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)

work page arXiv 2018

Showing first 80 references.