EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
hub Canonical reference
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
hub tools
citation-role summary
citation-polarity summary
roles
background 7representative citing papers
Conditional lower bounds prove that exact local-query auditing of near-optimal policies requires Ω(2^{Hopt^F(ε)}) queries when occupancy Rashomon capacity is realized by a sparse-signature packing.
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.
Preference poisoning against log-linear DPO reduces to a binary sparse approximation problem solved by lattice-reduction (BAL-A) and matching-pursuit (BMP-A) algorithms that carry recovery guarantees.
Moira parameterizes hierarchical RL policies for pair trading with LLMs and adapts them via prompt updates based on trajectory and episode feedback, outperforming baselines on real market data.
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.
Empirical study of Malaysian preference judgments finds that 79% of prompts have multiple majority-supported responses discarded by single-winner aggregation, indicating measurement of argmax rather than plural alignment.
RLHF provides shallow alignment by inactivating partisan features and severing causal pathways in LLMs without erasing partisan geometry, as evidenced by sparse autoencoder analysis and steering experiments.
Open-ended preference data reveals substantial plurality in what people want from AI and divergent interpretations of shared values such as truthfulness.
Open-weight LLMs display domain-dependent compliance with harmful requests spanning 71 percentage points, including a technical framing bypass that overrides safety without detectable signals, replicated in shape on closed frontier models.
Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Characterizes spurious correlation mechanisms in preference optimization via mean spurious bias and causal-spurious correlation leakage, demonstrates irreducible vulnerability to distribution shift, and introduces tie training as selective mitigation with validation on log-linear models and empirica
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
LLMs show partial internal coherence in medical decisions but frequently fail to accurately report their preferences or adopt user-directed ones via prompting.
Recursive generative retraining with heterogeneous rewards converges to a stable distribution satisfying a weighted Nash bargaining solution, preserving diversity under stated conditions.
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
citing papers explorer
-
Post-AGI Economies: Superposition and the Second Fundamental Theorem of Welfare Economics
An autonomy-qualified Second Welfare Theorem is stated for post-AGI economies under the joint conditions of convexity, stable moral status, non-fungible rights, welfare selection, non-manipulation, governed self-modification, and verification.
-
Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning
ULPS integrates A*-generated symbolic trajectories, fine-tuned BERT priors, MC dropout uncertainty, and entropy-based blending into PPO, reporting over 9% accuracy gains and better sample efficiency on the MiniGridUnlockPickup benchmark.
-
ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks
ClaHF converts instance labels into preference signals via candidate predictions and a reward model, then applies RL optimization to improve text classification accuracy and calibration.
-
PrefPaint: Enhancing Medical Image Inpainting through Expert Human Feedback
PrefPaint uses D3PO and a Model Tree web interface to incorporate gastroenterologist feedback into Stable Diffusion inpainting, producing anatomically accurate polyp images that outperform prior methods in user studies.
-
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.
-
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
-
Cultivating Machine Intelligence: The OMEGA Shift from Top-Down Optimization to Autopoietic Cognitive Ecologies
The paper introduces the RECLAIM framework and OMEGA shift as a transition from top-down optimization to autopoietic cognitive ecologies for cultivating machine intelligence.
-
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
-
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
-
AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks
AI researchers must lead technical research in arms control to mitigate risks from military AI systems, drawing lessons from nuclear deterrence.
-
The Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence
Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.
-
Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent
A survey provides a task-based formalization of meta-learning and meta-RL while chronicling algorithms that lead to DeepMind's Adaptive Agent.