pith. sign in

arxiv: 2504.13276 · v3 · submitted 2025-04-17 · 📡 eess.SY · cs.SY

Planning Stealthy Backdoor Attacks in MDPs with Observation-Based Triggers

Pith reviewed 2026-05-22 18:42 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords backdoor attacksMarkov decision processesstochastic controladversarial attacksobservation-based triggersgradient optimizationconstrained Markov games
0
0 comments X

The pith

An attacker jointly optimizes a backdoor policy and observation-based trigger in Markov decision processes using only a black-box simulator to create stealthy attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that backdoor attacks can be planned in stochastic control systems by modeling them as a game between two players on an augmented state space. One player finds a policy that stays near-optimal on the original system but switches to harmful behavior when triggered. The second player designs a trigger that works from partial observations and small changes to transition probabilities. A switching gradient algorithm solves the joint problem. If successful, this shows that standard testing may miss attacks that only activate under specific hidden conditions.

Core claim

The paper claims that backdoor attack planning in MDPs can be cast as a constrained Markov game. Player 0 learns a backdoor policy that maximizes attack rewards when the trigger is active yet behaves near-optimally when the trigger is inactive. Player 1 designs a finite-memory trigger based on partial state observations. A switching gradient-based optimization algorithm jointly solves for both the policy and the trigger, and case-study experiments confirm that the resulting attacks remain stealthy while achieving high success rates.

What carries the argument

A constrained Markov game on an augmented state space in which one player optimizes the backdoor policy and the other designs the finite-memory, observation-based trigger.

If this is right

  • The backdoor policy passes testing because it remains near-optimal whenever the trigger is absent.
  • The trigger activates the attack only through finite memory of partial observations and small dynamic perturbations.
  • Attack performance changes predictably with parameters that control how stealthy the backdoor policy must appear.
  • The joint optimization produces attacks that succeed without requiring full state information at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar game formulations could be used by defenders to search for possible triggers in existing control policies.
  • The method suggests that simulation-based red-teaming should become standard practice for certifying safety-critical MDPs.
  • Extending the trigger design to continuous observation spaces would test whether the same switching-gradient approach remains tractable.

Load-bearing premise

The attacker has access to a black-box simulator of the original MDP and can introduce small perturbations to the transition dynamics while only observing partial states.

What would settle it

If experiments on the case-study MDP show that the learned policy either performs substantially worse than optimal in the untriggered original system or fails to activate the malicious behavior under the partial observations available at runtime, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2504.13276 by Ahmed H. Hemida, Charles A. Kamhoua, Jie Fu, Shuo Han, Xinyi Wei.

Figure 1
Figure 1. Figure 1: A 6 × 6 Stochastic Gridworld We consider a robot motion planning problem in a stochas￾tic Gridworld shown in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The evaluations of policies over iterations: [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of attack performances under different parameters [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The evaluations of policies over iterations: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

This paper investigates backdoor attack planning in stochastic control systems modeled as Markov Decision Processes (MDPs). A backdoor attack involves an adversary deploying a policy that performs well in the original MDP to pass testing, but behaves maliciously at runtime when combined with a trigger that perturbs system dynamics. We consider a sophisticated attacker capable of jointly optimizing the backdoor policy and its trigger using only a blackbox simulator. During execution, the attacker has access only to partial observations of the system state and is restricted to introduce small perturbations to the system's transition dynamics. We formulate the attack planning problem as a constrained Markov game with an augmented state space and two players: Player 0 learns a backdoor policy that maximizes attack rewards when the trigger is active. However, when the trigger is inactive, the backdoor policy behaves near-optimally in the original MDP; Player 1 designs a finite-memory, observation-based trigger to activate the attack. We propose a switching gradient-based optimization algorithm to jointly solve for the backdoor policy and trigger. Experiments on a case study demonstrate the effectiveness of our method in achieving stealthy and successful backdoor attacks, and how the attack performance varies under different parameters related to the stealthiness of the backdoor attack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper formulates backdoor attack planning in MDPs as a constrained Markov game on an augmented state space. Player 0 optimizes a backdoor policy that is near-optimal on the clean MDP but malicious when an observation-based trigger is active; Player 1 designs a finite-memory trigger. The attacker is assumed to have black-box simulator access and may apply only small perturbations to the transition kernel while observing partial states. A switching gradient-based algorithm is proposed to jointly solve the game, with effectiveness shown via a single case study.

Significance. If the central algorithmic claim holds, the work would be significant for security analysis of stochastic control systems, as it provides a concrete planning procedure for stealthy, observation-triggered backdoors under partial information. The constrained-game formulation and joint optimization of policy and trigger are technically interesting and could inform both attack and defense research in MDPs. However, the absence of convergence analysis or quantitative experimental detail currently limits the result's immediate impact.

major comments (2)
  1. [§4] §4 (Switching Gradient-Based Optimization): the manuscript presents the switching gradient algorithm for jointly solving the constrained Markov game but supplies neither a convergence proof nor an analysis of the non-stationarity induced by the switching rule or the augmented state. This is load-bearing for the central claim that the algorithm reliably produces a stealthy backdoor policy and trigger.
  2. [§5] §5 (Case Study): the abstract and summary state that experiments demonstrate effectiveness, yet no quantitative metrics, ablation studies on the switching rule or stealthiness bound, or comparisons to baselines are reported. Without these, the empirical support for the joint-optimization claim cannot be evaluated.
minor comments (1)
  1. [§3] The notation for the augmented state and the finite-memory trigger could be clarified with an explicit diagram or pseudocode in §3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address the major concerns point by point below, clarifying our position and outlining planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Switching Gradient-Based Optimization): the manuscript presents the switching gradient algorithm for jointly solving the constrained Markov game but supplies neither a convergence proof nor an analysis of the non-stationarity induced by the switching rule or the augmented state. This is load-bearing for the central claim that the algorithm reliably produces a stealthy backdoor policy and trigger.

    Authors: We acknowledge that the manuscript does not include a formal convergence proof for the switching gradient algorithm. The switching mechanism and augmented state do introduce non-stationarity, which precludes direct application of standard policy-gradient convergence results. In the revised version we will add a dedicated subsection that (i) explicitly discusses the sources of non-stationarity, (ii) provides a sketch of why a full proof is technically difficult under the current formulation, and (iii) reports empirical convergence diagnostics (e.g., value-function trajectories and trigger-activation statistics) from the case study to support practical reliability. We will also analyze the effect of the augmented state dimension on gradient variance. revision: partial

  2. Referee: [§5] §5 (Case Study): the abstract and summary state that experiments demonstrate effectiveness, yet no quantitative metrics, ablation studies on the switching rule or stealthiness bound, or comparisons to baselines are reported. Without these, the empirical support for the joint-optimization claim cannot be evaluated.

    Authors: We agree that the experimental section is currently limited and would benefit from additional quantitative detail. The single case study was intended as a proof-of-concept illustration rather than a comprehensive evaluation. In the revision we will expand §5 to report concrete metrics (attack success rate when the trigger is active, clean-MDP performance degradation as a stealthiness measure, and trigger activation frequency), include ablation studies on the switching interval and trigger memory length, and add comparisons against two baselines: a fixed (non-learned) trigger and a non-joint optimization approach that alternates between policy and trigger updates without the switching gradient rule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; formulation and algorithm are independent of results

full rationale

The paper formulates the attack planning problem as a constrained Markov game with augmented state space, defines two players for the backdoor policy and observation-based trigger, and proposes a switching gradient-based optimization algorithm to jointly solve it. Effectiveness is shown via experiments on a single case study under varying stealthiness parameters. No load-bearing step reduces a claimed result to a fitted parameter, self-defined quantity, or self-citation chain by construction; the optimization procedure and empirical demonstration remain external to the inputs rather than tautological. The derivation chain is self-contained against the stated MDP model and black-box simulator assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard MDP transition and reward assumptions plus the new modeling choice of an observation-based trigger; no additional free parameters or invented physical entities are introduced in the abstract.

free parameters (1)
  • perturbation magnitude bound
    Small perturbations to transition dynamics are allowed but their exact size is a tunable stealthiness parameter.
axioms (1)
  • domain assumption System dynamics can be modeled as an MDP with partial observations
    Invoked when the attacker is restricted to partial observations and small dynamics perturbations.

pith-pipeline@v0.9.0 · 5764 in / 1143 out tokens · 32901 ms · 2026-05-22T18:42:20.947202+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    TrojDRL: Evaluation of Backdoor Attacks on Deep Reinforcement Learning,

    P. Kiourti, K. Wardega, S. Jha, and W. Li, “TrojDRL: Evaluation of Backdoor Attacks on Deep Reinforcement Learning,” in2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6, July 2020. ISSN: 0738-100X

  2. [2]

    BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets,

    C. Gong, Z. Yang, Y . Bai, J. He, J. Shi, K. Li, A. Sinha, B. Xu, X. Hou, D. Lo, and T. Wang, “BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets,” Mar. 2024. arXiv:2210.04688 [cs]

  3. [3]

    BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning,

    J. Cui, Y . Han, Y . Ma, J. Jiao, and J. Zhang, “BadRL: Sparse Targeted Backdoor Attack against Reinforcement Learning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 11687–11694, Mar. 2024

  4. [4]

    Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses,

    M. Goldblum, D. Tsipras, C. Xie, X. Chen, A. Schwarzschild, D. Song, A. M ˛ adry, B. Li, and T. Goldstein, “Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 1563–1580, Feb. 2023. Conference Name: IEEE Transactions on Pattern Analysis and Mach...

  5. [5]

    Planting Undetectable Backdoors in Machine Learning Models : [Extended Abstract],

    S. Goldwasser, M. P. Kim, V . Vaikuntanathan, and O. Zamir, “Planting Undetectable Backdoors in Machine Learning Models : [Extended Abstract],” in2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 931–942, Oct. 2022. ISSN: 2575- 8454

  6. [6]

    Deep Learning Backdoors,

    S. Li, S. Ma, M. Xue, and B. Z. H. Zhao, “Deep Learning Backdoors,” inSecurity and Artificial Intelligence: A Crossdisciplinary Approach (L. Batina, T. Bäck, I. Buhan, and S. Picek, eds.), pp. 313–334, Cham: Springer International Publishing, 2022

  7. [7]

    Robust Control of Markov Decision Processes with Uncertain Transition Matrices,

    A. Nilim and L. El Ghaoui, “Robust Control of Markov Decision Processes with Uncertain Transition Matrices,”Operations Research, vol. 53, pp. 780–798, Oct. 2005. Publisher: INFORMS

  8. [8]

    Robust Markov Decision Processes,

    W. Wiesemann, D. Kuhn, and B. Rustem, “Robust Markov Decision Processes,”Mathematics of Operations Research, vol. 38, pp. 153– 183, Feb. 2013

  9. [9]

    Provable Defense against Backdoor Policies in Reinforcement Learning,

    S. Bharti, X. Zhang, A. Singla, and J. Zhu, “Provable Defense against Backdoor Policies in Reinforcement Learning,”Advances in Neural Information Processing Systems, vol. 35, pp. 14704–14714, Dec. 2022

  10. [10]

    Cooperative Back- door Attack in Decentralized Reinforcement Learning with Theoretical Guarantee,

    M. Gao, Y . Zou, Z. Zhang, X. Cheng, and D. Yu, “Cooperative Back- door Attack in Decentralized Reinforcement Learning with Theoretical Guarantee,” May 2024. arXiv:2405.15245 [cs]

  11. [11]

    Solving POMDPs by Searching the Space of Finite Policies

    N. Meuleau, K.-E. Kim, L. P. Kaelbling, and A. R. Cassandra, “Solving pomdps by searching the space of finite policies,”arXiv preprint arXiv:1301.6720, 2013

  12. [12]

    Independent Learning in Con- strained Markov Potential Games,

    P. Jordan, A. Barakat, and N. He, “Independent Learning in Con- strained Markov Potential Games,” inProceedings of The 27th Inter- national Conference on Artificial Intelligence and Statistics, pp. 4024– 4032, PMLR, Apr. 2024. ISSN: 2640-3498

  13. [13]

    Stochastic first-order methods for convex and nonconvex functional constrained optimization,

    D. Boob, Q. Deng, and G. Lan, “Stochastic first-order methods for convex and nonconvex functional constrained optimization,”Mathe- matical Programming, vol. 197, pp. 215–279, Jan. 2023

  14. [14]

    First-Order Methods for Nonsmooth Non- convex Functional Constrained Optimization with or without Slater Points,

    Z. Jia and B. Grimmer, “First-Order Methods for Nonsmooth Non- convex Functional Constrained Optimization with or without Slater Points,” Jan. 2025. arXiv:2212.00927 [math]