Title resolution pending

Training a Helpful, Harmless Assistant with Reinforcement Learning from Human Feedback , author= · 2022

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Sync-R1 applies cooperative RL with Sync-GRPO and Dynamic Group Scaling to achieve superior cross-task personalized reasoning in multimodal models on the new UnifyBench++ dataset.

Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs

cs.SI · 2026-05-10 · unverdicted · novelty 6.0

LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.

MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service

cs.DC · 2026-05-08 · unverdicted · novelty 6.0

MARLaaS enables concurrent RL fine-tuning across up to 32 tasks using LoRA adapters and a disaggregated asynchronous architecture, matching single-task performance while improving accelerator utilization by 4.3x and cutting end-to-end time by 85%.

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

cs.LG · 2024-02-22 · conditional · novelty 6.0

REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

citing papers explorer

Showing 4 of 4 citing papers.

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning cs.CV · 2026-05-11 · unverdicted · none · ref 23
Sync-R1 applies cooperative RL with Sync-GRPO and Dynamic Group Scaling to achieve superior cross-task personalized reasoning in multimodal models on the new UnifyBench++ dataset.
Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs cs.SI · 2026-05-10 · unverdicted · none · ref 75
LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.
MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service cs.DC · 2026-05-08 · unverdicted · none · ref 23
MARLaaS enables concurrent RL fine-tuning across up to 32 tasks using LoRA adapters and a disaggregated asynchronous architecture, matching single-task performance while improving accelerator utilization by 4.3x and cutting end-to-end time by 85%.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs cs.LG · 2024-02-22 · conditional · none · ref 81
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer