hub

A survey of reinforcement learning from human feedback.arXiv preprint arXiv:2312.14925

Kaufmann, T · 2025 · arXiv 2312.14925

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Incentivizing High-Quality Human Annotations with Golden Questions

cs.GT · 2025-05-25 · unverdicted · novelty 7.0

The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.

PortraitGen: Exemplar-Driven GRPO with Dual-Reward Guidance for Photorealistic Portrait Generation

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

PortraitGen integrates real-image exemplars into GRPO sampling and applies dual rewards (OmniReward and AI-Portrait) to improve photorealism, claiming better results than baselines on a new PortraitBench.

APPO: Agentic Procedural Policy Optimization

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

APPO refines branching and credit assignment in agentic RL via a Branching Score and procedure-level scaling, improving baselines by nearly 4 points on 13 benchmarks.

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Smaller models provide temporally correlated policy-level diversity that serves as structured exploration for training larger models in GRPO, yielding accuracy gains such as +8.8% on AIME 24 with reduced compute via the S2L-PO framework.

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

EvalVerse is a pipeline-aware benchmark that distills expert cinematic judgments into VLMs to assess 'goodness' metrics like aesthetics and multi-shot coherence alongside basic prompt adherence.

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

Mitigating Cognitive Bias in RLHF by Altering Rationality

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.

ToolRL: Reward is All Tool Learning Needs

cs.LG · 2025-04-16 · conditional · novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

cs.LG · 2025-02-10 · unverdicted · novelty 6.0

Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

Staleness-Learning Rate Scaling Laws for Asynchronous RLHF

cs.LG · 2026-07-01 · unverdicted · novelty 5.0 · 2 refs

Stale rollouts introduce O(S * eta) surrogate-gradient bias in async GRPO, yielding stability condition eta << min{R_batch / (S * G_upd), R_crit / (T * G_upd)} under smoothness assumptions.

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

cs.LG · 2026-06-20 · unverdicted · novelty 5.0

Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.

Generating Place-Based Compromises Between Two Points of View

cs.CL · 2026-04-27 · unverdicted · novelty 5.0

Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.

Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation

eess.IV · 2025-10-16 · unverdicted · novelty 5.0

RL4Seg3D applies reinforcement learning with novel reward functions and fusion to adapt echocardiography segmentation models across domains, improving accuracy, anatomical validity, and temporal consistency on over 30,000 videos without target labels.

Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications

cs.CL · 2026-05-10 · unverdicted · novelty 4.0

RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.

AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks

cs.CY · 2026-06-10 · unverdicted · novelty 2.0

AI researchers must lead technical research in arms control to mitigate risks from military AI systems, drawing lessons from nuclear deterrence.

The Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence

cs.GL · 2026-04-08 · unverdicted · novelty 2.0

Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.

Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

cs.LG · 2025-06-12

citing papers explorer

Showing 1 of 1 citing paper after filters.

AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks cs.CY · 2026-06-10 · unverdicted · none · ref 47
AI researchers must lead technical research in arms control to mitigate risks from military AI systems, drawing lessons from nuclear deterrence.

A survey of reinforcement learning from human feedback.arXiv preprint arXiv:2312.14925

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer