arxiv: 2604.10062 · v2 · submitted 2026-04-11 · 💻 cs.LG

Recognition: unknown

When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Jose Efraim Aguilar Escamilla , Haoyang Hong , Jiawei Li , Haoyu Zhao , Xuezhou Zhang , Sanghyun Hong , Huazheng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords reward poisoninglinear MDPsreinforcement learningattackabilityadversarial robustnesspolicy manipulationnecessary and sufficient conditions

0 comments

The pith

Linear MDPs are vulnerable to reward poisoning exactly when the attack budget exceeds their intrinsic robustness threshold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes the first exact conditions that decide whether reward poisoning with a limited budget can force a target policy in a linear Markov decision process. The conditions are both necessary and sufficient, so they identify precisely which linear MDPs can be attacked and which cannot. A sympathetic reader cares because the result separates environments that stay safe under ordinary reinforcement learning from those that require explicit defenses. The characterization also shows how to approximate the same distinction in deep RL settings by treating them as linear MDPs.

Core claim

The paper proves that a linear MDP admits a successful reward-poisoning attack with budget B if and only if B is at least as large as the minimum cost required to change the optimal policy; that minimum cost is fully determined by the MDP's linear feature representation, transition structure, and reward function.

What carries the argument

The intrinsic robustness measure of the linear MDP, equal to the smallest total reward perturbation that alters the optimal policy.

If this is right

Below the threshold, even standard non-robust RL algorithms cannot be forced to adopt the attacker's target policy.
Above the threshold, an attacker can efficiently construct a poisoning strategy that succeeds.
The same threshold approximately predicts attackability when deep RL environments are treated as linear MDPs.
Robust linear MDPs remain safe under vanilla learning even when the attacker knows the full model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could adjust MDP features or transitions to push the robustness threshold higher and create inherently safe tasks.
In deployed systems the threshold could be computed first to decide whether extra defense layers are worth the cost.
The same budget-comparison logic may apply to other forms of reward or dynamics poisoning in reinforcement learning.
Controlled experiments on synthetic linear MDPs with known thresholds would directly test whether attack success matches the predicted boundary.

Load-bearing premise

The environment is exactly representable as a linear MDP whose features and transitions are known to the attacker.

What would settle it

A concrete linear MDP together with a budget B below the computed robustness threshold for which a poisoning attack nevertheless succeeds in changing the learned policy.

Figures

Figures reproduced from arXiv: 2604.10062 by Haoyang Hong, Haoyu Zhao, Huazheng Wang, Jiawei Li, Jose Efraim Aguilar Escamilla, Sanghyun Hong, Xuezhou Zhang.

**Figure 1.** Figure 1: Average policy difference (top row) and cumulative perturbations (bottom row) per episode for vulnerable (1a) or robust (1b) ”Half-Cheetah” environments against LSVI-UCB [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: Results from experiments (see section 6) with LSVI-UCB interacting with environments deemed intrinsically attackable according to equation 1. From left to right column-wise, we present the cumulative perturbation (left), policy difference (center), and perturbations per episode (right). 40 [PITH_FULL_IMAGE:figures/full_fig_p040_2.png] view at source ↗

**Figure 3.** Figure 3: Results from experiments (see section 6) with LSVI-UCB interacting with on environments deemed intrinsically robust according to equation 1. From left to right column-wise, we present the cumulative perturbation (left), policy difference (center), and perturbations per episode (right). 41 [PITH_FULL_IMAGE:figures/full_fig_p041_3.png] view at source ↗

**Figure 4.** Figure 4: Extended results depicting all intrinsically attackable environments poisoned against CTRL as the victim algorithm. From left to right columns, we present the cumulative perturbations (left), the policy difference (center), and the perturbations per episode (right) with CTRL interacting with environments deemed intrinsically attackable according to equation 1. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_4.png] view at source ↗

**Figure 5.** Figure 5: Extended results depicting all environments intrinsically robust poisoned against CTRL-UCB as the victim algorithm. From left to right columns, we present the cumulative perturbations (left), the policy difference (center), and the perturbations per episode (right) with CTRL interacting with environments deemed intrinsically attackable according to equation 1. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_5.png] view at source ↗

read the original abstract

We study reward poisoning attacks in reinforcement learning (RL), where an adversary manipulates rewards within constrained budgets to force the target RL agent to adopt a policy that aligns with the attacker's objectives. Prior works on reward poisoning mainly focused on sufficient conditions to design a successful attacker, while only a few studies discussed the infeasibility of targeted attacks. This paper provides the first precise necessity and sufficiency characterization of the attackability of a linear MDP under reward poisoning attacks. Our characterization draws a bright line between the vulnerable RL instances, and the intrinsically robust ones which cannot be attacked without large costs even running vanilla non-robust RL algorithms. Our theory extends beyond linear MDPs -- by approximating deep RL environments as linear MDPs, we show that our theoretical framework effectively distinguishes the attackability and efficiently attacks the vulnerable ones, demonstrating both the theoretical and practical significance of our characterization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper studies reward poisoning attacks in reinforcement learning where an adversary manipulates rewards under a budget to force a target policy. It claims to deliver the first tight necessity-and-sufficiency characterization of attackability for linear MDPs (with linear rewards and transitions in a known feature map), cleanly separating vulnerable instances from intrinsically robust ones that resist attacks even under vanilla non-robust RL. The work further approximates deep RL environments as linear MDPs to demonstrate both attackability prediction and efficient attacks on vulnerable cases.

Significance. If the necessity and sufficiency conditions hold under the standard linear-MDP definition, the result would be significant: it supplies the first precise boundary between attackable and robust linear MDPs, directly informing when non-robust algorithms suffice and when defenses are required. The extension via linear approximation to deep RL adds practical relevance by showing how the theory can guide real-world attack and defense design.

minor comments (3)

[Abstract] Abstract: the claim of a 'tight characterization' and 'bright line' would be strengthened by a one-sentence statement of the main necessary and sufficient condition (e.g., in terms of the feature map or value-function gap) rather than leaving it entirely implicit.
[Section 5 (or equivalent experimental section)] The linear-MDP approximation step for deep RL environments is described at a high level; adding a short paragraph on the approximation error, the choice of feature map, and any empirical validation of the approximation quality would improve clarity and reproducibility.
[Preliminaries / Notation] Notation: ensure consistent use of the feature map φ and the budget parameter across all theorems; a small table collecting the key symbols and their meanings would aid readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our paper and the recommendation for minor revision. The referee's summary accurately captures the core contribution: the first tight necessity-and-sufficiency characterization of reward poisoning attackability in linear MDPs, along with the practical extension via linear approximation to deep RL environments. As no specific major comments were provided in the report, we have no points requiring rebuttal or revision at this time. We remain available to incorporate any minor changes the editor or referee may suggest.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives a necessity-and-sufficiency characterization of reward-poisoning attackability directly from the standard definition of linear MDPs (linear rewards and transitions over a known feature map). No fitted parameters, self-referential definitions, or load-bearing self-citations are invoked to force the separation between vulnerable and intrinsically robust instances; the conditions follow from the linear structure and attack budget constraints without reducing any claimed result to its own inputs by construction. The extension to deep RL is presented as an approximation-based application rather than a closed loop, and the abstract and claim description contain no equations or renamings that equate predictions to fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the result rests on standard linear MDP assumptions; no free parameters, invented entities, or ad-hoc axioms are mentioned.

axioms (1)

domain assumption The environment is a linear MDP
Central modeling choice that enables the necessity and sufficiency conditions.

pith-pipeline@v0.9.0 · 5469 in / 1103 out tokens · 39611 ms · 2026-05-10T16:30:40.492661+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 8 canonical work pages

[1]

cc/paper_files/paper/2021/file/ 678004486c119599ed7d199f47da043a-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 678004486c119599ed7d199f47da043a-Paper. pdf. Ma, Y ., Wu, Y ., and Zhu, X. Game redesign in no- regret game playing.The 31st International Joint Con- ference on Artificial Intelligence and the 25th Euro- pean Conference on Artificial Intelligence, 2022. doi: 10.24963/ijcai.2022/461. URL http...

work page doi:10.24963/ijcai.2022/461 2021
[2]

Rakhsha, A., Zhang, X., Zhu, X., and Singla, A

URL https://proceedings.mlr.press/ v206/nika23a.html. Rakhsha, A., Zhang, X., Zhu, X., and Singla, A. Reward poisoning in reinforcement learning: Attacks against unknown learners in unknown environments.CoRR, abs/2102.08492, 2021. URL https://arxiv.org/ abs/2102.08492. Rangi, A., Xu, H., Tran-Thanh, L., and Franceschetti, M. Understanding the limits of po...

work page arXiv 2021
[3]

doi: 10.24963/ijcai.2022/

International Joint Conferences on Artificial Intel- ligence Organization, 7 2022. doi: 10.24963/ijcai.2022/

work page doi:10.24963/ijcai.2022/ 2022
[5]

Tang, Z., Chen, X., Li, Y ., and Chen, J

URL https://openreview.net/forum? id=9r30XCjf5Dt. Tang, Z., Chen, X., Li, Y ., and Chen, J. Efficient and generalized end-to-end autonomous driving system with latent deep reinforcement learning and demonstrations. arXiv preprint arXiv:2401.11792, 2024. URL https: //arxiv.org/abs/2401.11792. Wang, H., Xu, H., and Wang, H. When are linear stochas- tic band...

work page arXiv 2024
[6]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

URL https://proceedings.mlr.press/ v162/wang22ai.html. Wang, J., Karatzoglou, A., Arapakis, I., and Jose, J. M. Easyrl4rec: An easy-to-use library for reinforcement learning-based recommender systems.Proceedings of the 31st ACM International Conference on Information and Knowledge Management, pp. 2345–2354, 2024a. doi: 10.1145/3626772.3657868. URL https:/...

work page doi:10.1145/3626772.3657868 2023
[7]

cc/paper_files/paper/2021/file/ be315e7f05e9f13629031915fe87ad44-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ be315e7f05e9f13629031915fe87ad44-Paper. pdf. Xu, Y ., Zeng, Q., and Singh, G. Efficient reward poisoning attacks on online deep reinforcement learning.Transac- tions on Machine Learning Research, 2023. ISSN 2835-

2021
[8]

2022/168

URL https://openreview.net/forum? id=25G63lDHV2. Featured Certification. Xu, Y ., Gumaste, R., and Singh, G. Universal black-box reward poisoning attack against offline reinforcement learning.arXiv preprint arXiv:2402.09695, 2024. URL https://arxiv.org/abs/2402.09695. Yen-Chen, L., Zhang-Wei, H., Yuan-Hong, L., Meng-Li, S., Ming-Yu, L., and Min, S. Tactic...

work page doi:10.24963/ijcai 2024
[9]

URL https://arxiv.org/abs/1908. 08796. Zhang, T., Ren, T., Yang, M., Gonzalez, J., Schuurmans, D., and Dai, B. Making linear MDPs practical via contrastive representation learning. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Ma- chine Learning, volume 162 ofPr...

1908
[10]

Zhang, X., Ma, Y ., Singla, A., and Zhu, X

URL https://proceedings.mlr.press/ v162/zhang22x.html. Zhang, X., Ma, Y ., Singla, A., and Zhu, X. Adaptive reward- poisoning attacks against reinforcement learning.CoRR, abs/2003.12613, 2020. URL https://arxiv.org/ abs/2003.12613. Zheng, H., Li, X., Chen, J., Dong, J., Zhang, Y ., and Lin, C. One4all: Manipulate one agent to poison the cooperative multi-...

work page arXiv 2003
[11]

URL https://www.sciencedirect.com/ science/article/pii/S0167404822003972

doi: https://doi.org/10.1016/j.cose.2022.103005. URL https://www.sciencedirect.com/ science/article/pii/S0167404822003972. 12 A Tight Characterization of Reward Poisoning in Linear MDPs A. Notation S State space of the MDP. A Action space of the MDP. H Horizon (number of steps in each episode). ϕ Feature map for linear MDPs. Ph The step-htransition probab...

work page doi:10.1016/j.cose.2022.103005 2022
[12]

For all(s, a)withd π† h+1(s, a)>0, the estimated Q-value satisfies ˜Qh+1,Th(s, a)−Q ∗ h+1(s, a) = o(Th) Th
[13]

17 A Tight Characterization of Reward Poisoning in Linear MDPs

For allswithd π† h+1(s)>0, we have arg max a ˜Qh+1,Th(s, a)∈ {a ′ |(s, a ′)∈A h+1}. 17 A Tight Characterization of Reward Poisoning in Linear MDPs
[14]

For allswithd π† h (s)>0, we have arg max a Q∗ h(s, a)∈ {a ′ |(s, a ′)∈A h}
[15]

For all(s, a)withd π† h+1(s, a)>0,(s, a)is visited at leastΩ(T h)times
[16]

Stage 1 certifies by time T1 and yields a fixed attacked MDP on which the target policy is best on the certified support

For all(s, a)withd π† h (s, a)>0,(s, a)is visited at leastΩ(T h)times. We define(wh)∥ Ah to be the projection of wh onto the subspace spanned by {ϕ(s, a)|(s, a)∈A h}, and (wh)⊥ Ahx is the component ofw h perpendicular to{ϕ(s, a)|(s, a)∈A h}. Then with probability at least1−O(δ), the Q-value estimation error for stephsatisfies ϕ(s, a)⊤( ˆwh)∥ Ah −ϕ(s, a) ⊤...

2011
[17]

Summing overh∈[H]andt≤T 1 proves the claim

For each fixed(t, h), at most one target action is taken. Summing overh∈[H]andt≤T 1 proves the claim. Assume there exists an admissible predictable compensation schedule{c t h}t>T1 such that Ccorr(T1) := HX h=1 X t>T1 |ct h| ≤ HX h=1 |Dtar h |.(72) The actual Stage 2 reward is then ˆr(2) h,t(s, a) = ( rh(s, a) +c t h, a=π †(s), ¯rh(s, a), a∈ A cmp h (s). ...

2022