GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.
super hub Canonical reference
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and
authors
co-cited works
representative citing papers
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
Contrastive pair presentations yield exact identifiability characterizations via a geometric refinement of Angluin's condition, a new contrastive closure dimension for generation, mutual incomparability with text identification, and a single algorithm that tolerates any finite corruption budget.
Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
Alignment defenses adapted from DPO and GRPO mitigate property inference attacks on LLMs while preserving utility.
BenSyc is the first benchmark for conversational sycophancy in Bengali, with top LLMs achieving only 61.8 Macro-F1 on binary detection and 61.7 on five-class classification while often generating overly validating responses.
LLMs in a pre-specified cheap-talk benchmark over-reveal by 1.8-4.2x relative to the most-informative equilibrium, producing NMI of 0.78-0.94 against oracle values of 0.18-0.53 and exhibiting bias-tracking exaggeration rather than strategic coarsening.
An explicitly exploratory iterative NLHF method achieves O(sqrt(T)) regret for Nash equilibria under general preference models, removing the exponential KL dependence that plagues standard iterative approaches.
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
TRI-TSMC is a trust-region framework for learning twisting functions in SMC-based inference-time alignment of diffusion models that yields zero-variance samplers in theory and better alignment on text and image tasks under fixed budgets.
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
First convergence analysis of DPO under federated and decentralized training, characterizing rates via client drift, communication frequency, preference heterogeneity, and graph spectral connectivity.
GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging and no correlation with human convincingness.
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
citing papers explorer
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
-
DanceGRPO: Unleashing GRPO on Visual Generation
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
-
Gemini: A Family of Highly Capable Multimodal Models
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
-
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse rewards on capable teachers for exploration followed by dense distillation to students outperforms direct sparse reward application like GRPO on the deployment model.
-
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
An SCM-GRPO framework grounds multi-hop reasoning in structural dependency graphs and optimizes chain length via rule-based RL, outperforming baselines on HoVer and EX-FEVER.
-
Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator
The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.