The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
hub
A survey of reinforcement learning from human feedback.arXiv preprint arXiv:2312.14925
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Smaller models provide temporally correlated policy-level diversity that serves as structured exploration for training larger models in GRPO, yielding accuracy gains such as +8.8% on AIME 24 with reduced compute via the S2L-PO framework.
EvalVerse is a pipeline-aware benchmark that distills expert cinematic judgments into VLMs to assess 'goodness' metrics like aesthetics and multi-shot coherence alongside basic prompt adherence.
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
RL4Seg3D applies reinforcement learning with novel reward functions and fusion to adapt echocardiography segmentation models across domains, improving accuracy, anatomical validity, and temporal consistency on over 30,000 videos without target labels.
RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.
citing papers explorer
-
Incentivizing High-Quality Human Annotations with Golden Questions
The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
-
Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO
Smaller models provide temporally correlated policy-level diversity that serves as structured exploration for training larger models in GRPO, yielding accuracy gains such as +8.8% on AIME 24 with reduced compute via the S2L-PO framework.
-
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
EvalVerse is a pipeline-aware benchmark that distills expert cinematic judgments into VLMs to assess 'goodness' metrics like aesthetics and multi-shot coherence alongside basic prompt adherence.
-
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
-
Mitigating Cognitive Bias in RLHF by Altering Rationality
Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.
-
ToolRL: Reward is All Tool Learning Needs
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
-
How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators
Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.
-
HybridFlow: A Flexible and Efficient RLHF Framework
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
-
Generating Place-Based Compromises Between Two Points of View
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
-
Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation
RL4Seg3D applies reinforcement learning with novel reward functions and fusion to adapt echocardiography segmentation models across domains, improving accuracy, anatomical validity, and temporal consistency on over 30,000 videos without target labels.
-
Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications
RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
-
The Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence
Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.
- Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs