pith. machine review for the scientific record. sign in

hub

Fine-Tuning Language Models from Human Preferences

80 Pith papers cite this work. Polarity classification is still indexing.

80 Pith papers citing it
abstract

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

hub tools

claims ledger

  • abstract Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentimen

co-cited works

representative citing papers

Efficient Preference Poisoning Attack on Offline RLHF

cs.LG · 2026-05-04 · unverdicted · novelty 8.0

Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Convex Optimization with Nested Evolving Feasible Sets

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

For convex losses in nested evolving feasible sets, a lazy algorithm balances O(T^{1-β}) regret with O(T^β) movement for any β; for strongly convex or sharp losses, Frugal achieves zero regret with O(log T) movement, shown optimal by matching lower bound.

Interactive Episodic Memory with User Feedback

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

KTO: Model Alignment as Prospect Theoretic Optimization

cs.LG · 2024-02-02 · conditional · novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

Let's Verify Step by Step

cs.LG · 2023-05-31 · accept · novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

Red Teaming Language Models with Language Models

cs.CL · 2022-02-07 · conditional · novelty 7.0

One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.

Driving Intents Amplify Planning-Oriented Reinforcement Learning

cs.RO · 2026-05-12 · unverdicted · novelty 6.0

DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).

citing papers explorer

Showing 50 of 80 citing papers.