pith. machine review for the scientific record. sign in

hub

Fine-Tuning Language Models from Human Preferences

81 Pith papers cite this work. Polarity classification is still indexing.

81 Pith papers citing it
abstract

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

hub tools

claims ledger

  • abstract Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentimen

co-cited works

clear filters

representative citing papers

Efficient Preference Poisoning Attack on Offline RLHF

cs.LG · 2026-05-04 · unverdicted · novelty 8.0

Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Convex Optimization with Nested Evolving Feasible Sets

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

For convex losses in nested evolving feasible sets, a lazy algorithm balances O(T^{1-β}) regret with O(T^β) movement for any β; for strongly convex or sharp losses, Frugal achieves zero regret with O(log T) movement, shown optimal by matching lower bound.

Interactive Episodic Memory with User Feedback

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

KTO: Model Alignment as Prospect Theoretic Optimization

cs.LG · 2024-02-02 · conditional · novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

Let's Verify Step by Step

cs.LG · 2023-05-31 · accept · novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

Red Teaming Language Models with Language Models

cs.CL · 2022-02-07 · conditional · novelty 7.0

One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.

citing papers explorer

Showing 7 of 7 citing papers after filters.

  • Interactive Episodic Memory with User Feedback cs.CV · 2026-04-27 · unverdicted · none · ref 47 · internal anchor

    Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.

  • Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 56 · internal anchor

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

  • Implicit Preference Alignment for Human Image Animation cs.CV · 2026-05-08 · unverdicted · none · ref 12 · internal anchor

    IPA aligns animation models for superior hand quality via implicit reward maximization on self-generated samples plus hand-focused local optimization, avoiding expensive paired data.

  • VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis cs.CV · 2026-04-08 · unverdicted · none · ref 58 · internal anchor

    VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.

  • Visual-RFT: Visual Reinforcement Fine-Tuning cs.CV · 2025-03-03 · conditional · none · ref 53 · internal anchor

    Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

  • SHIFT: Steering Hidden Intermediates in Flow Transformers cs.CV · 2026-04-10 · unverdicted · none · ref 36 · internal anchor

    SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.

  • Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 181 · internal anchor

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.