Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
citation dossier
arXiv preprint arXiv:2307.15217 , year=
why this work matters in Pith
Pith has found this work in 19 reviewed papers. Its strongest current cluster is cs.LG (9 papers). The largest review-status bucket among citing papers is UNVERDICTED (18 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.
representative citing papers
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
Moira parameterizes hierarchical RL policies for pair trading with LLMs and adapts them via prompt updates based on trajectory and episode feedback, outperforming baselines on real market data.
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
LLMs show partial internal coherence in medical decisions but frequently fail to accurately report their preferences or adopt user-directed ones via prompting.
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
Clinician overrides of AI recommendations provide implicit preference signals for training clinical AI, addressed via a new framework with five-category taxonomy, patient-state and clinician-capability conditioned preferences, and dual reward-capability learning to prevent suppression bias.
The First Fundamental Theorem of Welfare Economics holds for autonomy-complete competitive equilibria that are autonomy-Pareto efficient, with the classical version recovered in the low-autonomy limit.
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
AIRA is a 15-check audit framework that finds AI-generated code has 1.8 times more high-severity failure-untruthful patterns than human-written code in a matched replication study.
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.
citing papers explorer
-
Efficient Preference Poisoning Attack on Offline RLHF
Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
-
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
-
Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading
Moira parameterizes hierarchical RL policies for pair trading with LLMs and adapts them via prompt updates based on trajectory and episode feedback, outperforming baselines on real market data.
-
Three Models of RLHF Annotation: Extension, Evidence, and Authority
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
-
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
-
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
-
Can Revealed Preferences Clarify LLM Alignment and Steering?
LLMs show partial internal coherence in medical decisions but frequently fail to accurately report their preferences or adopt user-directed ones via prompting.
-
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
-
Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care
Clinician overrides of AI recommendations provide implicit preference signals for training clinical AI, addressed via a new framework with five-category taxonomy, patient-state and clinician-capability conditioned preferences, and dual reward-capability learning to prevent suppression bias.
-
Post-AGI Economies: Autonomy and the First Fundamental Theorem of Welfare Economics
The First Fundamental Theorem of Welfare Economics holds for autonomy-complete competitive equilibria that are autonomy-Pareto efficient, with the classical version recovered in the low-autonomy limit.
-
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
-
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
AIRA: AI-Induced Risk Audit: A Structured Inspection Framework for AI-Generated Code
AIRA is a 15-check audit framework that finds AI-generated code has 1.8 times more high-severity failure-untruthful patterns than human-written code in a matched replication study.
-
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
-
The Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence
Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.