citation dossier

arXiv preprint arXiv:2307.15217 , year=

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J ´er´emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al · 2023 · arXiv 2307.15217

19Pith papers citing it

19reference links

cs.LGtop field · 9 papers

UNVERDICTEDtop verdict bucket · 18 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 19 reviewed papers. Its strongest current cluster is cs.LG (9 papers). The largest review-status bucket among citing papers is UNVERDICTED (18 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Efficient Preference Poisoning Attack on Offline RLHF

cs.LG · 2026-05-04 · unverdicted · novelty 8.0

Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.

Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading

cs.AI · 2026-05-03 · unverdicted · novelty 7.0

Moira parameterizes hierarchical RL policies for pair trading with LLMs and adapts them via prompt updates based on trajectory and episode feedback, outperforming baselines on real market data.

Three Models of RLHF Annotation: Extension, Evidence, and Authority

cs.CY · 2026-04-28 · unverdicted · novelty 7.0

RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

cs.CV · 2026-04-15 · conditional · novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.

Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

Can Revealed Preferences Clarify LLM Alignment and Steering?

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

LLMs show partial internal coherence in medical decisions but frequently fail to accurately report their preferences or adopt user-directed ones via prompting.

Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.

Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care

cs.LG · 2026-04-30 · unverdicted · novelty 6.0

Clinician overrides of AI recommendations provide implicit preference signals for training clinical AI, addressed via a new framework with five-category taxonomy, patient-state and clinician-capability conditioned preferences, and dual reward-capability learning to prevent suppression bias.

Post-AGI Economies: Autonomy and the First Fundamental Theorem of Welfare Economics

econ.TH · 2026-04-23 · unverdicted · novelty 6.0

The First Fundamental Theorem of Welfare Economics holds for autonomy-complete competitive equilibria that are autonomy-Pareto efficient, with the classical version recovered in the low-autonomy limit.

PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

cs.CR · 2026-04-11 · unverdicted · novelty 6.0

PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

AIRA: AI-Induced Risk Audit: A Structured Inspection Framework for AI-Generated Code

cs.SE · 2026-04-19 · unverdicted · novelty 5.0

AIRA is a 15-check audit framework that finds AI-generated code has 1.8 times more high-severity failure-untruthful patterns than human-written code in a matched replication study.

Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

cs.LG · 2026-05-08 · unverdicted · novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

The Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence

cs.GL · 2026-04-08 · unverdicted · novelty 2.0

Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.

citing papers explorer

Showing 19 of 19 citing papers.

Efficient Preference Poisoning Attack on Offline RLHF cs.LG · 2026-05-04 · unverdicted · none · ref 36
Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences cs.LG · 2026-05-08 · unverdicted · none · ref 56
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading cs.AI · 2026-05-03 · unverdicted · none · ref 8
Moira parameterizes hierarchical RL policies for pair trading with LLMs and adapts them via prompt updates based on trajectory and episode feedback, outperforming baselines on real market data.
Three Models of RLHF Annotation: Extension, Evidence, and Authority cs.CY · 2026-04-28 · unverdicted · none · ref 10
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback cs.LG · 2026-04-21 · unverdicted · none · ref 33
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation cs.CV · 2026-04-15 · conditional · none · ref 36
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models cs.LG · 2026-04-03 · unverdicted · none · ref 2
TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training cs.LG · 2026-05-11 · unverdicted · none · ref 34
Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 57
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
Can Revealed Preferences Clarify LLM Alignment and Steering? cs.LG · 2026-05-08 · unverdicted · none · ref 3
LLMs show partial internal coherence in medical decisions but frequently fail to accurately report their preferences or adopt user-directed ones via prompting.
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training cs.LG · 2026-05-08 · unverdicted · none · ref 124
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care cs.LG · 2026-04-30 · unverdicted · none · ref 5
Clinician overrides of AI recommendations provide implicit preference signals for training clinical AI, addressed via a new framework with five-category taxonomy, patient-state and clinician-capability conditioned preferences, and dual reward-capability learning to prevent suppression bias.
Post-AGI Economies: Autonomy and the First Fundamental Theorem of Welfare Economics econ.TH · 2026-04-23 · unverdicted · none · ref 16
The First Fundamental Theorem of Welfare Economics holds for autonomy-complete competitive equilibria that are autonomy-Pareto efficient, with the classical version recovered in the low-autonomy limit.
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification cs.CR · 2026-04-11 · unverdicted · none · ref 21
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training cs.CR · 2026-04-09 · unverdicted · none · ref 5
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 140
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
AIRA: AI-Induced Risk Audit: A Structured Inspection Framework for AI-Generated Code cs.SE · 2026-04-19 · unverdicted · none · ref 2
AIRA is a 15-check audit framework that finds AI-generated code has 1.8 times more high-severity failure-untruthful patterns than human-written code in a matched replication study.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems cs.LG · 2026-05-08 · unverdicted · none · ref 79
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
The Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence cs.GL · 2026-04-08 · unverdicted · none · ref 12
Blackwell's Rao-Blackwell, Approachability, and Informativeness theorems provide frameworks for variance reduction, sequential decisions under uncertainty, and comparing information sources that remain relevant to AI.

arXiv preprint arXiv:2307.15217 , year=

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer