Planning Stealthy Backdoor Attacks in MDPs with Observation-Based Triggers

Ahmed H. Hemida; Charles A. Kamhoua; Jie Fu; Shuo Han; Xinyi Wei

arxiv: 2504.13276 · v3 · submitted 2025-04-17 · 📡 eess.SY · cs.SY

Planning Stealthy Backdoor Attacks in MDPs with Observation-Based Triggers

Xinyi Wei , Shuo Han , Ahmed H. Hemida , Charles A. Kamhoua , Jie Fu This is my paper

Pith reviewed 2026-05-22 18:42 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords backdoor attacksMarkov decision processesstochastic controladversarial attacksobservation-based triggersgradient optimizationconstrained Markov games

0 comments

The pith

An attacker jointly optimizes a backdoor policy and observation-based trigger in Markov decision processes using only a black-box simulator to create stealthy attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that backdoor attacks can be planned in stochastic control systems by modeling them as a game between two players on an augmented state space. One player finds a policy that stays near-optimal on the original system but switches to harmful behavior when triggered. The second player designs a trigger that works from partial observations and small changes to transition probabilities. A switching gradient algorithm solves the joint problem. If successful, this shows that standard testing may miss attacks that only activate under specific hidden conditions.

Core claim

The paper claims that backdoor attack planning in MDPs can be cast as a constrained Markov game. Player 0 learns a backdoor policy that maximizes attack rewards when the trigger is active yet behaves near-optimally when the trigger is inactive. Player 1 designs a finite-memory trigger based on partial state observations. A switching gradient-based optimization algorithm jointly solves for both the policy and the trigger, and case-study experiments confirm that the resulting attacks remain stealthy while achieving high success rates.

What carries the argument

A constrained Markov game on an augmented state space in which one player optimizes the backdoor policy and the other designs the finite-memory, observation-based trigger.

If this is right

The backdoor policy passes testing because it remains near-optimal whenever the trigger is absent.
The trigger activates the attack only through finite memory of partial observations and small dynamic perturbations.
Attack performance changes predictably with parameters that control how stealthy the backdoor policy must appear.
The joint optimization produces attacks that succeed without requiring full state information at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar game formulations could be used by defenders to search for possible triggers in existing control policies.
The method suggests that simulation-based red-teaming should become standard practice for certifying safety-critical MDPs.
Extending the trigger design to continuous observation spaces would test whether the same switching-gradient approach remains tractable.

Load-bearing premise

The attacker has access to a black-box simulator of the original MDP and can introduce small perturbations to the transition dynamics while only observing partial states.

What would settle it

If experiments on the case-study MDP show that the learned policy either performs substantially worse than optimal in the untriggered original system or fails to activate the malicious behavior under the partial observations available at runtime, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2504.13276 by Ahmed H. Hemida, Charles A. Kamhoua, Jie Fu, Shuo Han, Xinyi Wei.

**Figure 1.** Figure 1: A 6 × 6 Stochastic Gridworld We consider a robot motion planning problem in a stochastic Gridworld shown in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: The evaluations of policies over iterations: [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of attack performances under different parameters [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The evaluations of policies over iterations: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

This paper investigates backdoor attack planning in stochastic control systems modeled as Markov Decision Processes (MDPs). A backdoor attack involves an adversary deploying a policy that performs well in the original MDP to pass testing, but behaves maliciously at runtime when combined with a trigger that perturbs system dynamics. We consider a sophisticated attacker capable of jointly optimizing the backdoor policy and its trigger using only a blackbox simulator. During execution, the attacker has access only to partial observations of the system state and is restricted to introduce small perturbations to the system's transition dynamics. We formulate the attack planning problem as a constrained Markov game with an augmented state space and two players: Player 0 learns a backdoor policy that maximizes attack rewards when the trigger is active. However, when the trigger is inactive, the backdoor policy behaves near-optimally in the original MDP; Player 1 designs a finite-memory, observation-based trigger to activate the attack. We propose a switching gradient-based optimization algorithm to jointly solve for the backdoor policy and trigger. Experiments on a case study demonstrate the effectiveness of our method in achieving stealthy and successful backdoor attacks, and how the attack performance varies under different parameters related to the stealthiness of the backdoor attack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns backdoor attack planning into a two-player game on an augmented MDP with a switching gradient solver, but the method has no convergence analysis and rests on one case study.

read the letter

The paper models backdoor attacks on MDPs as a constrained Markov game with an augmented state. One player finds a policy that stays near-optimal on the original system but becomes malicious once a trigger fires. The second player designs a finite-memory trigger that works from partial observations and only permits small perturbations to the transition dynamics. They solve the joint problem with a switching gradient algorithm and test it on a case study that varies the stealth bound. This setup is new in the way it combines observation-based triggers, partial observability, and the explicit game structure for stochastic controllers. The formulation is straightforward and makes the attacker's black-box simulator assumption explicit, which helps clarify what the adversary can actually do. The case study shows that attack success can be traded against stealthiness in a controlled way. The main gaps are in the algorithm itself. There is no convergence argument for the switching procedure, no handling of the non-stationarity that switching introduces, and no ablations on gradient variance or sensitivity to the perturbation bound. All claims rest on a single case study without comparisons to simpler baselines or checks on how the finite-memory trigger length affects results. Readers working on security of learned controllers or on robust RL will find the modeling choice useful as a starting point for thinking about defenses. The work is coherent on its own terms and shows clear engagement with the attack planning problem, so it is worth sending to referees even though the optimization claims need more support. I would recommend peer review with the expectation that reviewers will press for either theoretical backing or broader experiments.

Referee Report

2 major / 1 minor

Summary. The paper formulates backdoor attack planning in MDPs as a constrained Markov game on an augmented state space. Player 0 optimizes a backdoor policy that is near-optimal on the clean MDP but malicious when an observation-based trigger is active; Player 1 designs a finite-memory trigger. The attacker is assumed to have black-box simulator access and may apply only small perturbations to the transition kernel while observing partial states. A switching gradient-based algorithm is proposed to jointly solve the game, with effectiveness shown via a single case study.

Significance. If the central algorithmic claim holds, the work would be significant for security analysis of stochastic control systems, as it provides a concrete planning procedure for stealthy, observation-triggered backdoors under partial information. The constrained-game formulation and joint optimization of policy and trigger are technically interesting and could inform both attack and defense research in MDPs. However, the absence of convergence analysis or quantitative experimental detail currently limits the result's immediate impact.

major comments (2)

[§4] §4 (Switching Gradient-Based Optimization): the manuscript presents the switching gradient algorithm for jointly solving the constrained Markov game but supplies neither a convergence proof nor an analysis of the non-stationarity induced by the switching rule or the augmented state. This is load-bearing for the central claim that the algorithm reliably produces a stealthy backdoor policy and trigger.
[§5] §5 (Case Study): the abstract and summary state that experiments demonstrate effectiveness, yet no quantitative metrics, ablation studies on the switching rule or stealthiness bound, or comparisons to baselines are reported. Without these, the empirical support for the joint-optimization claim cannot be evaluated.

minor comments (1)

[§3] The notation for the augmented state and the finite-memory trigger could be clarified with an explicit diagram or pseudocode in §3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address the major concerns point by point below, clarifying our position and outlining planned revisions to improve the manuscript.

read point-by-point responses

Referee: [§4] §4 (Switching Gradient-Based Optimization): the manuscript presents the switching gradient algorithm for jointly solving the constrained Markov game but supplies neither a convergence proof nor an analysis of the non-stationarity induced by the switching rule or the augmented state. This is load-bearing for the central claim that the algorithm reliably produces a stealthy backdoor policy and trigger.

Authors: We acknowledge that the manuscript does not include a formal convergence proof for the switching gradient algorithm. The switching mechanism and augmented state do introduce non-stationarity, which precludes direct application of standard policy-gradient convergence results. In the revised version we will add a dedicated subsection that (i) explicitly discusses the sources of non-stationarity, (ii) provides a sketch of why a full proof is technically difficult under the current formulation, and (iii) reports empirical convergence diagnostics (e.g., value-function trajectories and trigger-activation statistics) from the case study to support practical reliability. We will also analyze the effect of the augmented state dimension on gradient variance. revision: partial
Referee: [§5] §5 (Case Study): the abstract and summary state that experiments demonstrate effectiveness, yet no quantitative metrics, ablation studies on the switching rule or stealthiness bound, or comparisons to baselines are reported. Without these, the empirical support for the joint-optimization claim cannot be evaluated.

Authors: We agree that the experimental section is currently limited and would benefit from additional quantitative detail. The single case study was intended as a proof-of-concept illustration rather than a comprehensive evaluation. In the revision we will expand §5 to report concrete metrics (attack success rate when the trigger is active, clean-MDP performance degradation as a stealthiness measure, and trigger activation frequency), include ablation studies on the switching interval and trigger memory length, and add comparisons against two baselines: a fixed (non-learned) trigger and a non-joint optimization approach that alternates between policy and trigger updates without the switching gradient rule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; formulation and algorithm are independent of results

full rationale

The paper formulates the attack planning problem as a constrained Markov game with augmented state space, defines two players for the backdoor policy and observation-based trigger, and proposes a switching gradient-based optimization algorithm to jointly solve it. Effectiveness is shown via experiments on a single case study under varying stealthiness parameters. No load-bearing step reduces a claimed result to a fitted parameter, self-defined quantity, or self-citation chain by construction; the optimization procedure and empirical demonstration remain external to the inputs rather than tautological. The derivation chain is self-contained against the stated MDP model and black-box simulator assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard MDP transition and reward assumptions plus the new modeling choice of an observation-based trigger; no additional free parameters or invented physical entities are introduced in the abstract.

free parameters (1)

perturbation magnitude bound
Small perturbations to transition dynamics are allowed but their exact size is a tunable stealthiness parameter.

axioms (1)

domain assumption System dynamics can be modeled as an MDP with partial observations
Invoked when the attacker is restricted to partial observations and small dynamics perturbations.

pith-pipeline@v0.9.0 · 5764 in / 1143 out tokens · 32901 ms · 2026-05-22T18:42:20.947202+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate the attack planning problem as a constrained Markov game with an augmented state space and two players... We propose a switching gradient-based optimization algorithm to jointly solve for the backdoor policy and trigger.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the stealthy backdoor strategy... V₀(M_π†) ≥ (1−ε)V*₀(M)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

TrojDRL: Evaluation of Backdoor Attacks on Deep Reinforcement Learning,

P. Kiourti, K. Wardega, S. Jha, and W. Li, “TrojDRL: Evaluation of Backdoor Attacks on Deep Reinforcement Learning,” in2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6, July 2020. ISSN: 0738-100X

work page 2020
[2]

BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets,

C. Gong, Z. Yang, Y . Bai, J. He, J. Shi, K. Li, A. Sinha, B. Xu, X. Hou, D. Lo, and T. Wang, “BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets,” Mar. 2024. arXiv:2210.04688 [cs]

work page arXiv 2024
[3]

BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning,

J. Cui, Y . Han, Y . Ma, J. Jiao, and J. Zhang, “BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 11687–11694, Mar. 2024

work page 2024
[4]

Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses,

M. Goldblum, D. Tsipras, C. Xie, X. Chen, A. Schwarzschild, D. Song, A. M ˛ adry, B. Li, and T. Goldstein, “Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 1563–1580, Feb. 2023. Conference Name: IEEE Transactions on Pattern Analysis and Mach...

work page 2023
[5]

Planting Undetectable Backdoors in Machine Learning Models : [Extended Abstract],

S. Goldwasser, M. P. Kim, V . Vaikuntanathan, and O. Zamir, “Planting Undetectable Backdoors in Machine Learning Models : [Extended Abstract],” in2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 931–942, Oct. 2022. ISSN: 2575- 8454

work page 2022
[6]

Deep Learning Backdoors,

S. Li, S. Ma, M. Xue, and B. Z. H. Zhao, “Deep Learning Backdoors,” inSecurity and Artificial Intelligence: A Crossdisciplinary Approach (L. Batina, T. Bäck, I. Buhan, and S. Picek, eds.), pp. 313–334, Cham: Springer International Publishing, 2022

work page 2022
[7]

Robust Control of Markov Decision Processes with Uncertain Transition Matrices,

A. Nilim and L. El Ghaoui, “Robust Control of Markov Decision Processes with Uncertain Transition Matrices,”Operations Research, vol. 53, pp. 780–798, Oct. 2005. Publisher: INFORMS

work page 2005
[8]

Robust Markov Decision Processes,

W. Wiesemann, D. Kuhn, and B. Rustem, “Robust Markov Decision Processes,”Mathematics of Operations Research, vol. 38, pp. 153– 183, Feb. 2013

work page 2013
[9]

Provable Defense against Backdoor Policies in Reinforcement Learning,

S. Bharti, X. Zhang, A. Singla, and J. Zhu, “Provable Defense against Backdoor Policies in Reinforcement Learning,”Advances in Neural Information Processing Systems, vol. 35, pp. 14704–14714, Dec. 2022

work page 2022
[10]

Cooperative Back- door Attack in Decentralized Reinforcement Learning with Theoretical Guarantee,

M. Gao, Y . Zou, Z. Zhang, X. Cheng, and D. Yu, “Cooperative Back- door Attack in Decentralized Reinforcement Learning with Theoretical Guarantee,” May 2024. arXiv:2405.15245 [cs]

work page arXiv 2024
[11]

Solving POMDPs by Searching the Space of Finite Policies

N. Meuleau, K.-E. Kim, L. P. Kaelbling, and A. R. Cassandra, “Solving pomdps by searching the space of finite policies,”arXiv preprint arXiv:1301.6720, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[12]

Independent Learning in Con- strained Markov Potential Games,

P. Jordan, A. Barakat, and N. He, “Independent Learning in Con- strained Markov Potential Games,” inProceedings of The 27th Inter- national Conference on Artificial Intelligence and Statistics, pp. 4024– 4032, PMLR, Apr. 2024. ISSN: 2640-3498

work page 2024
[13]

Stochastic first-order methods for convex and nonconvex functional constrained optimization,

D. Boob, Q. Deng, and G. Lan, “Stochastic first-order methods for convex and nonconvex functional constrained optimization,”Mathe- matical Programming, vol. 197, pp. 215–279, Jan. 2023

work page 2023
[14]

First-Order Methods for Nonsmooth Non- convex Functional Constrained Optimization with or without Slater Points,

Z. Jia and B. Grimmer, “First-Order Methods for Nonsmooth Non- convex Functional Constrained Optimization with or without Slater Points,” Jan. 2025. arXiv:2212.00927 [math]

work page arXiv 2025

[1] [1]

TrojDRL: Evaluation of Backdoor Attacks on Deep Reinforcement Learning,

P. Kiourti, K. Wardega, S. Jha, and W. Li, “TrojDRL: Evaluation of Backdoor Attacks on Deep Reinforcement Learning,” in2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6, July 2020. ISSN: 0738-100X

work page 2020

[2] [2]

BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets,

C. Gong, Z. Yang, Y . Bai, J. He, J. Shi, K. Li, A. Sinha, B. Xu, X. Hou, D. Lo, and T. Wang, “BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets,” Mar. 2024. arXiv:2210.04688 [cs]

work page arXiv 2024

[3] [3]

BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning,

J. Cui, Y . Han, Y . Ma, J. Jiao, and J. Zhang, “BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 11687–11694, Mar. 2024

work page 2024

[4] [4]

Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses,

M. Goldblum, D. Tsipras, C. Xie, X. Chen, A. Schwarzschild, D. Song, A. M ˛ adry, B. Li, and T. Goldstein, “Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 1563–1580, Feb. 2023. Conference Name: IEEE Transactions on Pattern Analysis and Mach...

work page 2023

[5] [5]

Planting Undetectable Backdoors in Machine Learning Models : [Extended Abstract],

S. Goldwasser, M. P. Kim, V . Vaikuntanathan, and O. Zamir, “Planting Undetectable Backdoors in Machine Learning Models : [Extended Abstract],” in2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 931–942, Oct. 2022. ISSN: 2575- 8454

work page 2022

[6] [6]

Deep Learning Backdoors,

S. Li, S. Ma, M. Xue, and B. Z. H. Zhao, “Deep Learning Backdoors,” inSecurity and Artificial Intelligence: A Crossdisciplinary Approach (L. Batina, T. Bäck, I. Buhan, and S. Picek, eds.), pp. 313–334, Cham: Springer International Publishing, 2022

work page 2022

[7] [7]

Robust Control of Markov Decision Processes with Uncertain Transition Matrices,

A. Nilim and L. El Ghaoui, “Robust Control of Markov Decision Processes with Uncertain Transition Matrices,”Operations Research, vol. 53, pp. 780–798, Oct. 2005. Publisher: INFORMS

work page 2005

[8] [8]

Robust Markov Decision Processes,

W. Wiesemann, D. Kuhn, and B. Rustem, “Robust Markov Decision Processes,”Mathematics of Operations Research, vol. 38, pp. 153– 183, Feb. 2013

work page 2013

[9] [9]

Provable Defense against Backdoor Policies in Reinforcement Learning,

S. Bharti, X. Zhang, A. Singla, and J. Zhu, “Provable Defense against Backdoor Policies in Reinforcement Learning,”Advances in Neural Information Processing Systems, vol. 35, pp. 14704–14714, Dec. 2022

work page 2022

[10] [10]

Cooperative Back- door Attack in Decentralized Reinforcement Learning with Theoretical Guarantee,

M. Gao, Y . Zou, Z. Zhang, X. Cheng, and D. Yu, “Cooperative Back- door Attack in Decentralized Reinforcement Learning with Theoretical Guarantee,” May 2024. arXiv:2405.15245 [cs]

work page arXiv 2024

[11] [11]

Solving POMDPs by Searching the Space of Finite Policies

N. Meuleau, K.-E. Kim, L. P. Kaelbling, and A. R. Cassandra, “Solving pomdps by searching the space of finite policies,”arXiv preprint arXiv:1301.6720, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[12] [12]

Independent Learning in Con- strained Markov Potential Games,

P. Jordan, A. Barakat, and N. He, “Independent Learning in Con- strained Markov Potential Games,” inProceedings of The 27th Inter- national Conference on Artificial Intelligence and Statistics, pp. 4024– 4032, PMLR, Apr. 2024. ISSN: 2640-3498

work page 2024

[13] [13]

Stochastic first-order methods for convex and nonconvex functional constrained optimization,

D. Boob, Q. Deng, and G. Lan, “Stochastic first-order methods for convex and nonconvex functional constrained optimization,”Mathe- matical Programming, vol. 197, pp. 215–279, Jan. 2023

work page 2023

[14] [14]

First-Order Methods for Nonsmooth Non- convex Functional Constrained Optimization with or without Slater Points,

Z. Jia and B. Grimmer, “First-Order Methods for Nonsmooth Non- convex Functional Constrained Optimization with or without Slater Points,” Jan. 2025. arXiv:2212.00927 [math]

work page arXiv 2025