Preference poisoning against log-linear DPO reduces to a binary sparse approximation problem solved by lattice-reduction (BAL-A) and matching-pursuit (BMP-A) algorithms that carry recovery guarantees.
org/abs/1703.06748
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
RoAd-RL is a new benchmarking library for adversarial reinforcement learning that evaluates DQN, PPO, and SAC agents across 192 attack-defense configurations and finds substantial robustness variations plus cases where defenses harm performance more than attacks.
TRAP is a tail-aware ranking attack that plants a backdoor in world models so that a trigger causes the model to reorder a few critical imagined trajectories and redirect planning while preserving normal behavior on clean inputs.
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.
citing papers explorer
-
RoAd-RL: A Unified Library and Benchmark for Robust Adversarial Reinforcement Learning
RoAd-RL is a new benchmarking library for adversarial reinforcement learning that evaluates DQN, PPO, and SAC agents across 192 attack-defense configurations and finds substantial robustness variations plus cases where defenses harm performance more than attacks.