Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, Paul F Christiano · 2020

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives across four benchmarks and three model families.

Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.

Alignment Dynamics in LLM Fine-Tuning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.

ANO: A Principled Approach to Robust Policy Optimization

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.

Reinforcement Learning with Action Chunking

cs.LG · 2025-07-10 · unverdicted · novelty 6.0

Q-chunking improves offline-to-online RL sample efficiency on long-horizon sparse-reward manipulation tasks by applying action chunking to TD learning.

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.

citing papers explorer

Showing 7 of 7 citing papers.

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization cs.LG · 2026-05-11 · unverdicted · none · ref 56
MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives across four benchmarks and three model families.
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models cs.CV · 2026-05-07 · unverdicted · none · ref 33
ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
Alignment Dynamics in LLM Fine-Tuning cs.LG · 2026-05-18 · unverdicted · none · ref 4
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization cs.AI · 2026-05-18 · unverdicted · none · ref 16
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
ANO: A Principled Approach to Robust Policy Optimization cs.AI · 2026-05-04 · unverdicted · none · ref 29
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.
Reinforcement Learning with Action Chunking cs.LG · 2025-07-10 · unverdicted · none · ref 71
Q-chunking improves offline-to-online RL sample efficiency on long-horizon sparse-reward manipulation tasks by applying action chunking to TD learning.
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation cs.LG · 2026-05-21 · unverdicted · none · ref 26
A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.

Learning to summarize with human feedback

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer