arxiv: 2605.05977 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

Yinbo Yu , Xueyu Yin , Jiadai Wang , Chunwei Tian , Sai Xu , Qi Zhu , Daoqiang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords backdoor defensedeep reinforcement learningonline detectionaction distributionbehavioral driftsingle-agent DRLmulti-agent DRLtrigger-agnostic

0 comments

The pith

Backdoored DRL policies leave detectable shifts in action distributions even without triggers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that backdoor attacks on deep reinforcement learning agents force consistent changes in how often each action is chosen, visible in the upper quantiles and tails of the action distribution. These shifts occur to make the backdoor reliable and persist even when the trigger is not present. BehaviorGuard turns this observation into a runtime metric that flags and blocks suspicious actions on the fly. The approach works for both single-agent and multi-agent settings and avoids the need to recover triggers or retrain models, which existing defenses require.

Core claim

Regardless of attack type, backdoored policies induce consistent shifts in action distributions to ensure reliable activation, leaving detectable traces in high-quantile regions and distribution tails even in the absence of triggers; this property enables an online metric to identify and suppress backdoor actions at runtime.

What carries the argument

A behavioral-drift metric that measures deviations in action-distribution tails and high quantiles to detect and suppress backdoor actions during execution.

If this is right

Defense operates online without reconstructing triggers or performing costly fine-tuning.
The same metric applies to both single-agent and multi-agent reinforcement learning.
Detection relies on behavioral output rather than reward anomalies or model internals.
Computational cost stays low because no offline analysis or model update is needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distribution-shift idea could be tested in other sequential decision systems such as robotic control policies.
Adaptive attackers might try to minimize tail shifts, so the metric may require environment-specific thresholds.
Similar tail-monitoring techniques could be explored for backdoor detection in language-model agents or planning systems.

Load-bearing premise

Backdoored policies must always create consistent, detectable shifts in action distributions no matter which attack method is used.

What would settle it

Train a DRL policy with a backdoor attack that produces no measurable change in high-quantile or tail behavior of the action distribution compared with the clean policy when the trigger is absent.

Figures

Figures reproduced from arXiv: 2605.05977 by Chunwei Tian, Daoqiang Zhang, Jiadai Wang, Qi Zhu, Sai Xu, Xueyu Yin, Yinbo Yu.

**Figure 1.** Figure 1: (a) A 3×3 pixel patch in the upper left corner; (b) A se view at source ↗

**Figure 2.** Figure 2: Behavioral drift comparison of clean and backdoor poli view at source ↗

**Figure 3.** Figure 3: ROC curves of BehaviorGuard’s backdoor detection in (a) view at source ↗

**Figure 4.** Figure 4: Ablation studies of BehaviorGuard. (a) Defense perfor view at source ↗

read the original abstract

Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger patterns undermine their robustness, and fine-tuning entails high costs, limiting practical utility. Therefore, we shift defense concerns to trigger-agnostic backdoor output behaviors and propose BehaviorGuard, an online behavior-based backdoor detection and mitigation framework for DRL. Specifically, we find that regardless of attacks, backdoored policies induce consistent shifts in action distributions to ensure reliable activation, leaving detectable traces in high-quantile regions and distribution tails, even in the absence of triggers. Based on this, we design a novel metric that captures behavioral drift in action distributions to identify and suppress backdoor actions at runtime. To our knowledge, this is the first online backdoor defense that counters attacks both in single- and multi-agent DRL. Evaluated across diverse benchmarks with different backdoor attacks, BehaviorGuard consistently surpasses prior methods in both efficacy and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BehaviorGuard tries to defend DRL against backdoors by watching for shifts in action distributions at runtime, but the central assumption that every attack leaves detectable traces without the trigger looks shaky.

read the letter

The paper introduces BehaviorGuard as an online, trigger-agnostic way to spot and block backdoor actions in deep reinforcement learning. It claims this is the first method that works for both single-agent and multi-agent settings by tracking behavioral drift in action distributions rather than reversing triggers or retraining models. The authors observe that backdoored policies create consistent changes visible in high quantiles and distribution tails even when the trigger is absent, and they build a metric around that to suppress bad actions on the fly. They report better results than prior defenses on various benchmarks in both accuracy and speed. That shift to runtime behavior monitoring is the clearest advance here, and it targets a practical gap since trigger reversal often fails on complex patterns and fine-tuning is expensive for deployed agents. The focus on multi-agent cases also fills a gap that most existing work ignores. The evaluation claims consistent outperformance, which is worth checking if the numbers hold up under scrutiny. The soft spot is the load-bearing assumption that backdoored policies will always produce those detectable shifts regardless of attack type. If an attacker conditions the backdoor on rare state combinations and keeps the action distribution identical to the clean policy everywhere else, the tails and quantiles would look normal and the metric would give no signal. The abstract gives no formulas for the metric, no error analysis, and no breakdown by attack stealthiness, so it is hard to tell how often this actually happens in practice. Multi-agent coordination of the monitoring could introduce extra failure modes not addressed in the high-level description. This is for researchers working on secure DRL in robotics or autonomous systems who need runtime options. A reader looking for new angles on backdoor defense would find the perspective useful, though the experiments would need tighter validation. It deserves peer review because the problem is real, the approach is distinct from prior lines, and the claims are concrete enough to test.

Referee Report

2 major / 2 minor

Summary. The paper proposes BehaviorGuard, an online, trigger-agnostic defense against backdoor attacks in both single-agent and multi-agent deep reinforcement learning. It rests on the empirical observation that backdoored policies produce consistent shifts in action distributions (detectable in high-quantile regions and tails) even in the absence of triggers; a novel metric is introduced to quantify this behavioral drift and suppress anomalous actions at runtime. The work claims to be the first such online defense and reports superior efficacy and efficiency over prior methods across diverse benchmarks and attack types.

Significance. If the core distributional-shift property holds for the full range of attack strategies, the result would be significant: it supplies a practical, low-overhead runtime defense that avoids trigger reverse-engineering and costly fine-tuning. Extension to multi-agent DRL is a useful broadening. The approach could enable deployable protection in safety-critical DRL systems where triggers are unknown or complex.

major comments (2)

[Abstract / §3] Abstract and the statement of the key finding (presumably §3): the claim that 'regardless of attacks, backdoored policies induce consistent shifts in action distributions ... leaving detectable traces in high-quantile regions and distribution tails, even in the absence of triggers' is load-bearing for the entire online, trigger-agnostic method. This property must be shown to survive stealthy attacks that replicate the clean policy's action distribution on all but a small set of trigger states; otherwise the high-quantile metric produces no signal and the defense fails. The manuscript should either prove the property or add experiments with such conditional attacks.
[§4 / Tables] Evaluation section (presumably §4 and associated tables): the reported outperformance is presented without the explicit formula for the novel drift metric, the precise quantile thresholds used, or statistical error bars on detection rates. Without these, it is impossible to verify that the gains are robust rather than tuned to the chosen attack suite.

minor comments (2)

[Method] The metric definition should be given as a formal equation (with all parameters and the exact quantile or tail measure) to support reproducibility.
[Method] Clarify how the multi-agent case extends the single-agent metric (e.g., joint vs. per-agent action distributions) and whether any additional assumptions are introduced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive review. The comments identify important aspects of our core claim and evaluation that require clarification and strengthening. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract / §3] Abstract and the statement of the key finding (presumably §3): the claim that 'regardless of attacks, backdoored policies induce consistent shifts in action distributions ... leaving detectable traces in high-quantile regions and distribution tails, even in the absence of triggers' is load-bearing for the entire online, trigger-agnostic method. This property must be shown to survive stealthy attacks that replicate the clean policy's action distribution on all but a small set of trigger states; otherwise the high-quantile metric produces no signal and the defense fails. The manuscript should either prove the property or add experiments with such conditional attacks.

Authors: We appreciate the referee pointing out the load-bearing nature of this empirical observation. Our experiments across diverse single- and multi-agent benchmarks and attack types (including poisoning and trigger-based methods) consistently show detectable shifts in high-quantile action regions even without triggers, as backdoors must reliably override behavior on activation while preserving overall policy stability. We acknowledge that the manuscript does not include a theoretical proof that this holds for every conceivable attack, nor experiments with highly conditional stealthy attacks that match the clean distribution except on rare trigger states. Such attacks could indeed reduce the signal available to the drift metric. In revision we will add a dedicated limitations paragraph discussing this boundary case, note that effective stealthy attacks may still require non-trivial distributional adjustments to achieve reliable activation, and include new experiments with conditional trigger attacks where implementation is feasible. revision: partial
Referee: [§4 / Tables] Evaluation section (presumably §4 and associated tables): the reported outperformance is presented without the explicit formula for the novel drift metric, the precise quantile thresholds used, or statistical error bars on detection rates. Without these, it is impossible to verify that the gains are robust rather than tuned to the chosen attack suite.

Authors: We agree that these details are necessary for reproducibility and independent verification. The revised manuscript will explicitly state the mathematical formula for the behavioral drift metric, list the exact quantile thresholds (and any sensitivity analysis) used in all experiments, and add error bars (standard deviation across random seeds and runs) to the detection-rate tables. These additions will make the robustness of the reported gains clearer. revision: yes

standing simulated objections not resolved

A general theoretical proof that behavioral drift must occur for every possible backdoor attack strategy, including arbitrarily stealthy conditional attacks.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents the key premise as an empirical finding ('we find that regardless of attacks, backdoored policies induce consistent shifts in action distributions... even in the absence of triggers') and then designs a metric to capture the observed drift. No equations or steps reduce the claimed result to its own inputs by construction, no parameters are fitted on a subset and relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked. The chain is an observation-to-metric pipeline that remains independent of the target defense performance.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach rests on an empirical observation of action distribution shifts without detailing how the metric is constructed or normalized.

pith-pipeline@v0.9.0 · 5493 in / 993 out tokens · 42918 ms · 2026-05-08T10:40:56.043005+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Uni- versal trojan signatures in reinforcement learning

[Acharyaet al., 2023 ] Manoj Acharya, Weichao Zhou, Anir- ban Roy, Xiao Lin, Wenchao Li, and Susmit Jha. Uni- versal trojan signatures in reinforcement learning. In NeurIPS 2023 Workshop on Backdoors in Deep Learning- The Good, the Bad, and the Ugly,

2023
[2]

Safe reinforcement learning via shielding

[Alshiekhet al., 2018 ] Mohammed Alshiekh, Roderick Bloem, R ¨udiger Ehlers, Bettina K ¨onighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. InAAAI, volume 32,

2018
[3]

Emergent Complexity via Multi-Agent Competition

[Bansalet al., 2017 ] Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multi-agent competition.arXiv preprint arXiv:1710.03748,

work page Pith review arXiv 2017
[4]

The arcade learning environment: An evaluation platform for general agents

[Bellemareet al., 2013 ] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of artificial intelligence research, 47:253–279,

2013
[5]

Provable defense against backdoor policies in reinforcement learning.NeurIPS, 35:14704–14714,

[Bhartiet al., 2022 ] Shubham Bharti, Xuezhou Zhang, Adish Singla, and Jerry Zhu. Provable defense against backdoor policies in reinforcement learning.NeurIPS, 35:14704–14714,

2022
[6]

OpenAI Gym

[Brockmanet al., 2016 ] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review arXiv 2016
[7]

A comprehensive survey of multiagent reinforcement learning.IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172,

[Busoniuet al., 2008 ] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning.IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172,

2008
[8]

Marnet: Backdoor attacks against coop- erative multi-agent reinforcement learning.IEEE TDSC, 20(5):4188–4198,

[Chenet al., 2022 ] Yanjiao Chen, Zhicong Zheng, and Xueluan Gong. Marnet: Backdoor attacks against coop- erative multi-agent reinforcement learning.IEEE TDSC, 20(5):4188–4198,

2022
[9]

Bird: generalizable backdoor detection and removal for deep reinforcement learning.NeurIPS, 36:40786–40798,

[Chenet al., 2023 ] Xuan Chen, Wenbo Guo, Guanhong Tao, Xiangyu Zhang, and Dawn Song. Bird: generalizable backdoor detection and removal for deep reinforcement learning.NeurIPS, 36:40786–40798,

2023
[10]

Badrl: Sparse targeted backdoor attack against reinforcement learning

[Cuiet al., 2024 ] Jing Cui, Yufei Han, Yuzhe Ma, Jianbin Jiao, and Junge Zhang. Badrl: Sparse targeted backdoor attack against reinforcement learning. InAAAI, volume 38, pages 11687–11694,

2024
[11]

Blast: A stealthy backdoor leverage attack against cooperative multi-agent deep reinforcement learning based systems.arXiv preprint arXiv:2501.01593,

[Fanget al., 2025 ] Jing Fang, Saihao Yan, Xueyu Yin, Yinbo Yu, Chunwei Tian, and Jiajia Liu. Blast: A stealthy backdoor leverage attack against cooperative multi-agent deep reinforcement learning based systems.arXiv preprint arXiv:2501.01593,

work page arXiv 2025
[12]

Counterfactual multi-agent policy gradients

[Foersteret al., 2018 ] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In AAAI, volume 32,

2018
[13]

Strip: A defence against trojan attacks on deep neural net- works

[Gaoet al., 2019 ] Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. Strip: A defence against trojan attacks on deep neural net- works. InACSAC, pages 113–125,

2019
[14]

Baffle: Hiding backdoors in offline reinforcement learning datasets

[Gonget al., 2024 ] Chen Gong, Zhou Yang, Yunpeng Bai, Junda He, Jieke Shi, Kecen Li, Arunesh Sinha, Bowen Xu, Xinwen Hou, David Lo, et al. Baffle: Hiding backdoors in offline reinforcement learning datasets. InIEEE S&P, pages 2086–2104,

2024
[15]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

[Guet al., 2017 ] Tianyu Gu, Brendan Dolan-Gavitt, and Sid- dharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733,

work page internal anchor Pith review arXiv 2017
[16]

A re- view of safe reinforcement learning: Methods, theories and applications.IEEE TPAMI,

[Guet al., 2024 ] Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, and Alois Knoll. A re- view of safe reinforcement learning: Methods, theories and applications.IEEE TPAMI,

2024
[17]

Policycleanse: Backdoor detection and miti- gation for competitive reinforcement learning

[Guoet al., 2023 ] Junfeng Guo, Ang Li, Lixu Wang, and Cong Liu. Policycleanse: Backdoor detection and miti- gation for competitive reinforcement learning. InICCV, pages 4699–4708,

2023
[18]

Trojdrl: Trojan attacks on deep reinforcement learning agents.arXiv preprint arXiv:1903.06638,

[Kiourtiet al., 2019 ] Panagiota Kiourti, Kacper Wardega, Susmit Jha, and Wenchao Li. Trojdrl: Trojan attacks on deep reinforcement learning agents.arXiv preprint arXiv:1903.06638,

work page arXiv 2019
[19]

Trojaning attack on neural networks

[Liuet al., 2018 ] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. InNDSS. Internet Soc,

2018
[20]

Fine-pruning: Defending against backdooring attacks on deep neural networks

[Liuet al., 2025 ] Shijie Liu, Andrew C Cullen, Paul Mon- tague, Sarah Erfani, and Benjamin IP Rubinstein. Fox in the henhouse: Supply-chain backdoor attacks against re- inforcement learning.arXiv preprint arXiv:2505.19532,

work page arXiv 2025
[21]

Multi- agent actor-critic for mixed cooperative-competitive envi- ronments.NeurIPS, 30,

[Loweet al., 2017 ] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi- agent actor-critic for mixed cooperative-competitive envi- ronments.NeurIPS, 30,

2017
[22]

Foerster, and Shimon Whiteson

[Rashidet al., 2018 ] Tabish Rashid, Mikayel Samvelyan, Christian Schr ¨oder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. InICML, volume 80, pages 4292–4301,

2018
[23]

Monotonic value function factorisation for deep multi-agent reinforcement learning

[Rashidet al., 2020 ] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Fo- erster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 21(178):1–51,

2020
[24]

Adversarial inception backdoor at- tacks against reinforcement learning

[Rathbunet al., 2025 ] Ethan Rathbun, Alina Oprea, and Christopher Amato. Adversarial inception backdoor at- tacks against reinforcement learning. InICML,

2025
[25]

Proximal Policy Optimization Algorithms

[Schulmanet al., 2017 ] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review arXiv 2017
[26]

John Wiley & Sons,

[Scott, 2015] David W Scott.Multivariate density estima- tion: theory, practice, and visualization. John Wiley & Sons,

2015
[27]

Deep reinforcement learning for robotics: A sur- vey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188,

[Tanget al., 2025 ] Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Mart ´ın-Mart´ın, and Peter Stone. Deep reinforcement learning for robotics: A sur- vey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188,

2025
[28]

Grandmaster level in star- craft ii using multi-agent reinforcement learning.Nature, 575(7782):350–354,

[Vinyalset al., 2019 ] Oriol Vinyals, Igor Babuschkin, Woj- ciech M Czarnecki, Micha ¨el Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in star- craft ii using multi-agent reinforcement learning.Nature, 575(7782):350–354,

2019
[29]

Mitigating deep reinforcement learning back- doors in the neural activation space

[Vyaset al., 2024 ] Sanyam Vyas, Chris Hicks, and Vasilios Mavroudis. Mitigating deep reinforcement learning back- doors in the neural activation space. In2024 IEEE Security and Privacy Workshops (SPW), pages 76–86. IEEE,

2024
[30]

Neural cleanse: Identifying and mitigating backdoor attacks in neural networks

[Wanget al., 2019 ] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. InIEEE S&P, pages 707–723. IEEE,

2019
[31]

Backdoorl: Backdoor attack against competitive reinforcement learn- ing.arXiv preprint arXiv:2105.00579,

[Wanget al., 2021a ] Lun Wang, Zaynah Javed, Xian Wu, Wenbo Guo, Xinyu Xing, and Dawn Song. Backdoorl: Backdoor attack against competitive reinforcement learn- ing.arXiv preprint arXiv:2105.00579,

work page arXiv
[32]

Rethinking the reverse- engineering of trojan triggers.NeurIPS, 35:9738–9753,

[Wanget al., 2022 ] Zhenting Wang, Kai Mei, Hailun Ding, Juan Zhai, and Shiqing Ma. Rethinking the reverse- engineering of trojan triggers.NeurIPS, 35:9738–9753,

2022
[33]

A temporal-pattern backdoor attack to deep reinforcement learning

[Yuet al., 2022 ] Yinbo Yu, Jiajia Liu, Shouqing Li, Kepu Huang, and Xudong Feng. A temporal-pattern backdoor attack to deep reinforcement learning. InIEEE GLOBE- COM, pages 2710–2715,

2022
[34]

A spatiotemporal backdoor attack against behavior-oriented decision makers in metaverse: From perspective of autonomous driving.IEEE JSAC, 42(4):948–962,

[Yuet al., 2024 ] Yinbo Yu, Jiajia Liu, Hongzhi Guo, Bomin Mao, and Nei Kato. A spatiotemporal backdoor attack against behavior-oriented decision makers in metaverse: From perspective of autonomous driving.IEEE JSAC, 42(4):948–962,

2024
[35]

Shine: Shielding backdoors in deep reinforcement learning

[Yuanet al., 2024 ] Zhuowen Yuan, Wenbo Guo, Jinyuan Jia, Bo Li, and Dawn Song. Shine: Shielding backdoors in deep reinforcement learning. InICML,

2024
[36]

One4all: Manipulate one agent to poison the cooperative multi- agent reinforcement learning.Computers & Security, 124:103005,

[Zhenget al., 2023 ] Haibin Zheng, Xiaohao Li, Jinyin Chen, Jianfeng Dong, Yan Zhang, and Changting Lin. One4all: Manipulate one agent to poison the cooperative multi- agent reinforcement learning.Computers & Security, 124:103005,

2023