arxiv: 2605.01950 · v1 · submitted 2026-05-03 · 💻 cs.LG · cs.AI

Recognition: unknown

TRAP: Tail-aware Ranking Attack for World-Model Planning

Ke Zhang, Siyuan Duan, Xizhao Luo

Pith reviewed 2026-05-10 15:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords backdoor attackworld modelstrajectory rankingmodel-based reinforcement learningplanning agentsadversarial robustnessDreamerV3TD-MPC

0 comments

The pith

World models can be hijacked by reordering the ranking of a few critical imagined trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that world models plan by internally generating imagined trajectories and selecting among them based on their ranked value. This process creates a long-tailed distribution in which only a small number of trajectories are decision-critical, leaving the overall ranking open to targeted disruption. The authors present TRAP as a backdoor method that uses a specialized loss to focus on these tail trajectories and gating mechanisms to keep clean behavior intact. If correct, the claim means that standard defenses against input or policy attacks will not protect planning agents, because the vulnerability sits inside the ranking step that converts imagination into action.

Core claim

World models exhibit a distinct backdoor vulnerability rooted in the long-tailed ranking structure of imagined trajectories, where disrupting the ordering of a few decision-critical trajectories can systematically hijack planning. TRAP exploits this by combining a tail-aware ranking loss with dual gating mechanisms that stabilize optimization and regulate when the attack penalty is applied, resulting in redirected planning under trigger conditions while largely preserving normal ranking on clean inputs.

What carries the argument

The tail-aware ranking loss that focuses optimization on decision-critical trajectories, together with dual gating mechanisms that stabilize training and control application of the attack penalty.

If this is right

Under trigger conditions, planning outcomes are redirected toward attacker-chosen behaviors.
On clean inputs, the original ranking structure and task performance remain largely unchanged.
The attack produces sustained behavioral deviations and measurable performance drops on DreamerV3 and TD-MPC2 across multiple tasks.
Existing backdoor techniques aimed at features or one-step predictions are insufficient for world-model planners.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses will need to regularize or monitor the internal ranking distribution rather than only the input or final policy.
Any planning system that selects actions by ranking many generated candidates may inherit similar tail-based vulnerabilities.
Training procedures that deliberately flatten the trajectory-value distribution could reduce attack surface at the cost of planning efficiency.
The same ranking-attack pattern could be tested in non-RL generative planners that rely on internal simulation and selection.

Load-bearing premise

The long-tailed ranking structure of imagined trajectories stays stable enough on clean data to preserve performance yet remains fragile enough that a trigger can reorder the critical tail without being absorbed by the learned dynamics.

What would settle it

A test in which the trigger is applied yet the world model still selects and executes the original highest-ranked trajectories, showing that the dynamics prior overrode the ranking change.

Figures

Figures reproduced from arXiv: 2605.01950 by Ke Zhang, Siyuan Duan, Xizhao Luo.

**Figure 2.** Figure 2: Overview of TRAP. The trigger selectively suppresses decision-critical tail trajectories, shifts trajectory ranking, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation and mechanistic analysis of TRAP. (a) Component ablation under [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Diagnostic experiments and defense analysis of TRAP. (a) Effect of patch size ratio on Attack Success Rate (ASR) and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

World models enable long-horizon planning by internally generating and evaluating imagined trajectories, making them a promising foundation for generalist agents. However, this imagination-driven decision process also introduces new security risks. Existing backdoor attacks typically aim to manipulate local features, one-step predictions, or instantaneous policy outputs. While such objectives may suffice for weaker reactive models, they are often ineffective against world models, where the learned dynamics prior and planning process can absorb or wash out the effects of shallow perturbations. More importantly, we find that world models exhibit a distinct backdoor vulnerability rooted in the long-tailed ranking structure of imagined trajectories, where disrupting the ordering of a few decision-critical trajectories can systematically hijack planning. To exploit this vulnerability, we propose TRAP, a backdoor attack framework for world models that targets imagined trajectory ranking. TRAP combines a tail-aware ranking loss to focus optimization on decision-critical trajectories with dual gating mechanisms that stabilize optimization and regulate when and where the attack penalty is applied. Under trigger conditions, TRAP alters the relative ranking of imagined trajectories to redirect planning outcomes, while largely maintaining the normal ranking structure on clean inputs. Experiments on DreamerV3 and TD-MPC2 across diverse tasks show that TRAP consistently induces sustained behavioral deviations and significant performance degradation, highlighting the need for dedicated security evaluation of world-model-based agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRAP shows world models can be backdoored by flipping the ranking of a few key imagined trajectories, with working experiments on DreamerV3 and TD-MPC2 that keep clean behavior mostly intact.

read the letter

The punchline is that world models can be backdoored by targeting how they rank their imagined trajectories for planning, and TRAP provides a practical way to do it with a specialized loss and gating. This is new because earlier attacks focused on features or policy outputs, which world models can absorb through their dynamics. TRAP instead uses a tail-aware ranking loss to hit the few trajectories that matter most for decisions, and dual gating to control when the penalty applies and keep optimization stable. That combination lets the attack flip planning outcomes under a trigger without wrecking normal performance. What the paper does well is the experimental validation. It tests on DreamerV3 and TD-MPC2 across tasks, showing consistent hijacking with only minor clean degradation. The concrete formulations for the loss and gating make it feasible to check or build on. The soft spots are minor but worth noting. The key assumption that the long-tailed structure is stable for clean inputs yet vulnerable to reordering holds in the reported results, but the paper could do more to show this isn't just an artifact of the chosen models or tasks. Some additional ablations on trigger design or statistical significance of the performance drops would make the claims tighter, though nothing suggests the central results are off. This paper is aimed at people working on safe deployment of world-model agents and those developing defenses against backdoors in RL systems. A reader who cares about the security implications of planning in agents will get concrete ideas from it. It deserves peer review because it brings evidence of a distinct vulnerability with a working attack method on established world models.

Referee Report

0 major / 3 minor

Summary. The paper claims that world models for planning exhibit a distinct backdoor vulnerability rooted in the long-tailed ranking structure of imagined trajectories, where reordering a few decision-critical trajectories can hijack planning outcomes. It proposes TRAP, which combines a tail-aware ranking loss focused on critical trajectories with dual gating mechanisms to stabilize optimization and control attack application. Under trigger conditions, the attack redirects planning while largely preserving clean ranking and performance; experiments on DreamerV3 and TD-MPC2 across diverse tasks demonstrate sustained behavioral deviations with limited clean degradation.

Significance. If the results hold, the work is significant for identifying a planning-specific vulnerability in world models that differs from local-feature or one-step attacks on reactive models. The concrete loss formulation, gating logic, and empirical validation on two state-of-the-art algorithms provide reproducible evidence of the vulnerability and underscore the need for dedicated security evaluation of imagination-driven agents.

minor comments (3)

[Abstract] Abstract: the phrase 'diverse tasks' is used without enumeration; specifying the task suite (e.g., by name or category) would improve immediate readability.
[§3] §3 (Method): the dual-gating logic is described in prose; inclusion of a compact pseudocode block or flowchart would clarify the timing and scope of the attack penalty.
[§4] §4 (Experiments): while results are reported, moving key baseline comparisons and trigger-design details from the appendix into the main text would strengthen the central empirical narrative.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work on TRAP and for recommending minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical attack proposal with no reductive derivations

full rationale

The manuscript frames TRAP as an empirical backdoor attack on world-model planners, supported by concrete loss formulations, gating mechanisms, and experimental results on DreamerV3 and TD-MPC2. No equations, uniqueness theorems, or derivation chains are present that reduce the claimed vulnerability or attack success to fitted parameters, self-citations, or ansatzes by construction. The long-tailed ranking observation is presented as an empirical finding rather than a self-defined premise, and clean-performance preservation is validated externally via reported metrics rather than forced by the method definition. This satisfies the default expectation of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the empirical observation of long-tailed trajectory ranking, treated as a domain property rather than a derived quantity.

pith-pipeline@v0.9.0 · 5536 in / 1112 out tokens · 54084 ms · 2026-05-10T15:11:08.333431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 27 canonical work pages · 5 internal anchors

[1]

Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018. Syn- thesizing robust adversarial examples. InInternational conference on machine learning. PMLR, 284–293

2018
[2]

Fengshuo Bai, Runze Liu, Yali Du, Ying Wen, and Yaodong Yang. 2025. Rat: Adversarial attacks on deep reinforcement agents for targeted behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 15453–15461

2025
[3]

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. 2013. The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research47 (2013), 253–279

2013
[4]

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. 2024. Video generation models as world simulators.OpenAI Blog1, 8 (2024), 1

2024
[5]

Tom B Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer
[6]

Adversarial patch.arXiv preprint arXiv:1712.09665(2017)

work page arXiv 2017
[7]

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. 2024. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning

2024
[8]

Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. 2022. Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481(2022)

work page arXiv 2022
[9]

Jie Cheng, Ruixi Qiao, Yingwei Ma, Binhua Li, Gang Xiong, Qinghai Miao, Yong- bin Li, and Yisheng Lv. 2024. Scaling offline model-based rl via jointly-optimized world-action model pretraining.arXiv preprint arXiv:2410.00564(2024)

work page arXiv 2024
[10]

Jing Cui, Yufei Han, Yuzhe Ma, Jianbin Jiao, and Junge Zhang. 2024. Badrl: Sparse targeted backdoor attack against reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 11687–11694

2024
[11]

Chen Gong, Zhou Yang, Yunpeng Bai, Junda He, Jieke Shi, Kecen Li, Arunesh Sinha, Bowen Xu, Xinwen Hou, David Lo, et al. 2024. Baffle: Hiding backdoors in offline reinforcement learning datasets. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2086–2104

2024
[12]

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2019. Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access7 (2019), 47230–47244

2019
[13]

Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens Van Der Maaten
[14]

Countering adversarial images using input transformations.arXiv preprint arXiv:1711.00117(2017)

work page arXiv 2017
[15]

Ji Guo, Wenbo Jiang, Yansong Lin, Yijing Liu, Ruichen Zhang, Guomin Lu, Aiguo Chen, Xinshuo Han, Hongwei Li, and Dusit Niyato. 2026. State Backdoor: To- wards Stealthy Real-world Poisoning Attack on Vision-Language-Action Model in State Space.arXiv preprint arXiv:2601.04266(2026)

work page arXiv 2026
[16]

Alexander Gushchin, Khaled Abud, Georgii Bychkov, Ekaterina Shumitskaya, Anna Chistyakova, Sergey Lavrushkin, Bader Rasheed, Kirill Malyshev, Dmitriy Vatolin, and Anastasia Antsiferova. 2024. Guardians of image quality: Bench- marking defenses against adversarial attacks on image quality metrics.arXiv preprint arXiv:2408.01541(2024)

work page arXiv 2024
[17]

David Ha and Jürgen Schmidhuber. 2018. World models.arXiv preprint arXiv:1803.101222, 3 (2018), 440

work page internal anchor Pith review arXiv 2018
[18]

Danijar Hafner. 2021. Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780(2021)

work page arXiv 2021
[19]

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. 2019. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603(2019)

work page internal anchor Pith review arXiv 2019
[20]

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. 2019. Learning latent dynamics for planning from pixels. InInternational conference on machine learning. PMLR, 2555–2565

2019
[21]

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. 2020. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193 (2020)

work page arXiv 2020
[22]

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. 2023. Mas- tering diverse domains through world models.arXiv preprint arXiv:2301.04104 (2023)

work page internal anchor Pith review arXiv 2023
[23]

Nicklas Hansen, Hao Su, and Xiaolong Wang. 2023. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828(2023)

work page internal anchor Pith review arXiv 2023
[24]

Nicklas Hansen, Xiaolong Wang, and Hao Su. 2022. Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955(2022)

work page arXiv 2022
[25]

Haoran He, Yang Zhang, Liang Lin, Zhongwen Xu, and Ling Pan. 2026. Pre- trained video generative models as world simulators. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 4645–4653

2026
[26]

Yibin Hu, Xiaolin Sun, and Zizhan Zheng. [n. d.]. Stealthy World Model Manipu- lation via Data Poisoning. ([n. d.])
[27]

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. 2024. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385(2024)

work page arXiv 2024
[28]

Danny Karmon, Daniel Zoran, and Yoav Goldberg. 2018. Lavan: Localized and visible adversarial noise. InInternational conference on machine learning. PMLR, 2507–2515

2018
[29]

Panagiota Kiourti, Kacper Wardega, Susmit Jha, and Wenchao Li. 2020. Trojdrl: evaluation of backdoor attacks on deep reinforcement learning. In2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

2020
[30]

Jiajian Li, Qi Wang, Yunbo Wang, Xin Jin, Yang Li, Wenjun Zeng, and Xiaokang Yang. 2024. Open-world reinforcement learning over long short-term imagina- tion.arXiv preprint arXiv:2410.03618(2024)

work page arXiv 2024
[31]

Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2022. Backdoor learning: A survey.IEEE transactions on neural networks and learning systems35, 1 (2022), 5–22

2022
[32]

Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, and Min Sun. 2017. Tactics of adversarial attack on deep reinforcement learning agents.arXiv preprint arXiv:1703.06748(2017)

work page arXiv 2017
[33]

Fan-Ming Luo, Tian Xu, Hang Lai, Xiong-Hui Chen, Weinan Zhang, and Yang Yu. 2024. A survey on model-based reinforcement learning.Science China Information Sciences67, 2 (2024), 121101

2024
[34]

Oubo Ma, Linkang Du, Yang Dai, Chunyi Zhou, Qingming Li, Yuwen Pu, and Shouling Ji. 2025. Unidoor: A universal framework for action-level backdoor attacks in deep reinforcement learning.arXiv preprint arXiv:2501.15529(2025)

work page arXiv 2025
[35]

Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, and Sven Behnke. 2024. Sold: Slot object-centric latent dynamics models for relational manipulation learning from pixels.arXiv preprint arXiv:2410.08822(2024)

work page arXiv 2024
[36]

Pietro Novelli, Marco Pratticò, Massimiliano Pontil, and Carlo Ciliberto. 2024. Op- erator world models for reinforcement learning.Advances in Neural Information Processing Systems37 (2024), 111432–111463

2024
[37]

Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Hol- sheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, et al . 2024. Genie 2: A large-scale foundation world model. URL: https://deepmind. google/discover/blog/genie-2-a-large-scale-foundation- world-model2 (2024)

2024
[38]

Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017. Robust adversarial reinforcement learning. InInternational conference on machine learning. PMLR, 2817–2826

2017
[39]

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. 2024. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072(2024)

work page arXiv 2024
[40]

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al
[41]

Deepmind control suite.arXiv preprint arXiv:1801.00690(2018)

work page internal anchor Pith review arXiv 2018
[42]

Xiaofeng Wang, Zheng Zhu, Guan Huang, Boyuan Wang, Xinze Chen, and Jiwen Lu. 2024. Worlddreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985(2024)

work page arXiv 2024
[43]

Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long. 2025. Rlvr- world: Training world models with reinforcement learning.arXiv preprint arXiv:2505.13934(2025)

work page arXiv 2025
[44]

Weilin Xu, David Evans, and Yanjun Qi. 2017. Feature squeezing: Detecting adversarial examples in deep neural networks.arXiv preprint arXiv:1704.01155 (2017)

work page arXiv 2017
[45]

Zonghuan Xu, Xiang Zheng, Xingjun Ma, and Yu-Gang Jiang. 2025. TabVLA: Targeted Backdoor Attacks on Vision-Language-Action Models.arXiv preprint arXiv:2510.10932(2025)

work page arXiv 2025
[46]

Mingfu Xue, Xin Wang, Shichang Sun, Yushu Zhang, Jian Wang, and Weiqiang Liu. 2023. Compression-resistant backdoor attack against deep neural networks. Applied Intelligence53, 17 (2023), 20402–20417. Duan et al

2023
[47]

Weipu Zhang, Adam Jelley, Trevor McInroe, and Amos Storkey. 2025. Objects matter: object-centric world models improve reinforcement learning in visually complex environments.arXiv preprint arXiv:2501.16443(2025)

work page arXiv 2025
[48]

Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. 2023. Storm: Efficient stochastic transformer based world models for reinforcement learning. Advances in Neural Information Processing Systems36 (2023), 27147–27166

2023
[49]

Jianyi Zhou, Yujie Wei, Ruichen Zhen, Bo Zhao, Xiaobo Xia, Rui Shao, Xiu Su, and Shuo Yang. 2026. Inject Once Survive Later: Backdooring Vision-Language- Action Models to Persist Through Downstream Fine-tuning.arXiv preprint arXiv:2602.00500(2026)

work page arXiv 2026
[50]

Xueyang Zhou, Guiyao Tie, Guowen Zhang, Hechang Wang, Pan Zhou, and Lichao Sun. 2025. Badvla: Towards backdoor attacks on vision-language-action models via objective-decoupled optimization.arXiv preprint arXiv:2505.16640 (2025)

work page arXiv 2025