hub

A survey of reinforcement learning from human feedback.arXiv preprint arXiv:2312.14925

Kaufmann, T · 2023 · arXiv 2312.14925

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Incentivizing High-Quality Human Annotations with Golden Questions

cs.GT · 2025-05-25 · unverdicted · novelty 7.0

The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Smaller models provide temporally correlated policy-level diversity that serves as structured exploration for training larger models in GRPO, yielding accuracy gains such as +8.8% on AIME 24 with reduced compute via the S2L-PO framework.

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

EvalVerse is a pipeline-aware benchmark that distills expert cinematic judgments into VLMs to assess 'goodness' metrics like aesthetics and multi-shot coherence alongside basic prompt adherence.

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

Mitigating Cognitive Bias in RLHF by Altering Rationality

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.

ToolRL: Reward is All Tool Learning Needs

cs.LG · 2025-04-16 · conditional · novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

cs.LG · 2025-02-10 · unverdicted · novelty 6.0

Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

Generating Place-Based Compromises Between Two Points of View

cs.CL · 2026-04-27 · unverdicted · novelty 5.0

Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.

Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation

eess.IV · 2025-10-16 · unverdicted · novelty 5.0

RL4Seg3D applies reinforcement learning with novel reward functions and fusion to adapt echocardiography segmentation models across domains, improving accuracy, anatomical validity, and temporal consistency on over 30,000 videos without target labels.

Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications

cs.CL · 2026-05-10 · unverdicted · novelty 4.0

RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.

The Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence

cs.GL · 2026-04-08 · unverdicted · novelty 2.0

Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.

Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

cs.LG · 2025-06-12

citing papers explorer

Showing 2 of 2 citing papers after filters.

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 29
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
Mitigating Cognitive Bias in RLHF by Altering Rationality cs.AI · 2026-05-07 · unverdicted · none · ref 11
Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.

A survey of reinforcement learning from human feedback.arXiv preprint arXiv:2312.14925

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer