Transactions on Machine Learning Research , issn=

Open Problems, Fundamental Limitations of Reinforcement Learning from Human Feedback , author= · 2023

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Variance-aware Reward Modeling with Anchor Guidance

stat.ML · 2026-05-12 · unverdicted · novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

cs.AI · 2025-07-15 · unverdicted · novelty 5.0

Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

cs.LG · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

citing papers explorer

Showing 4 of 4 citing papers.

Variance-aware Reward Modeling with Anchor Guidance stat.ML · 2026-05-12 · unverdicted · none · ref 53
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures cs.LG · 2026-05-12 · unverdicted · none · ref 48
MOPD improves on-policy distillation by using peer successes and failures from multiple rollouts to construct more informative teacher signals, yielding consistent gains over baselines on reasoning benchmarks.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety cs.AI · 2025-07-15 · unverdicted · none · ref 98
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 85 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

Transactions on Machine Learning Research , issn=

fields

years

verdicts

representative citing papers

citing papers explorer