The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
hub
A survey of reinforcement learning from human feedback.arXiv preprint arXiv:2312.14925
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
PortraitGen integrates real-image exemplars into GRPO sampling and applies dual rewards (OmniReward and AI-Portrait) to improve photorealism, claiming better results than baselines on a new PortraitBench.
APPO refines branching and credit assignment in agentic RL via a Branching Score and procedure-level scaling, improving baselines by nearly 4 points on 13 benchmarks.
Smaller models provide temporally correlated policy-level diversity that serves as structured exploration for training larger models in GRPO, yielding accuracy gains such as +8.8% on AIME 24 with reduced compute via the S2L-PO framework.
EvalVerse is a pipeline-aware benchmark that distills expert cinematic judgments into VLMs to assess 'goodness' metrics like aesthetics and multi-shot coherence alongside basic prompt adherence.
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
Stale rollouts introduce O(S * eta) surrogate-gradient bias in async GRPO, yielding stability condition eta << min{R_batch / (S * G_upd), R_crit / (T * G_upd)} under smoothness assumptions.
Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
RL4Seg3D applies reinforcement learning with novel reward functions and fusion to adapt echocardiography segmentation models across domains, improving accuracy, anatomical validity, and temporal consistency on over 30,000 videos without target labels.
RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
AI researchers must lead technical research in arms control to mitigate risks from military AI systems, drawing lessons from nuclear deterrence.
Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.
citing papers explorer
-
AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks
AI researchers must lead technical research in arms control to mitigate risks from military AI systems, drawing lessons from nuclear deterrence.