pith. machine review for the scientific record. sign in

arxiv: 2605.01950 · v1 · submitted 2026-05-03 · 💻 cs.LG · cs.AI

Recognition: unknown

TRAP: Tail-aware Ranking Attack for World-Model Planning

Ke Zhang, Siyuan Duan, Xizhao Luo

Pith reviewed 2026-05-10 15:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords backdoor attackworld modelstrajectory rankingmodel-based reinforcement learningplanning agentsadversarial robustnessDreamerV3TD-MPC
0
0 comments X

The pith

World models can be hijacked by reordering the ranking of a few critical imagined trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that world models plan by internally generating imagined trajectories and selecting among them based on their ranked value. This process creates a long-tailed distribution in which only a small number of trajectories are decision-critical, leaving the overall ranking open to targeted disruption. The authors present TRAP as a backdoor method that uses a specialized loss to focus on these tail trajectories and gating mechanisms to keep clean behavior intact. If correct, the claim means that standard defenses against input or policy attacks will not protect planning agents, because the vulnerability sits inside the ranking step that converts imagination into action.

Core claim

World models exhibit a distinct backdoor vulnerability rooted in the long-tailed ranking structure of imagined trajectories, where disrupting the ordering of a few decision-critical trajectories can systematically hijack planning. TRAP exploits this by combining a tail-aware ranking loss with dual gating mechanisms that stabilize optimization and regulate when the attack penalty is applied, resulting in redirected planning under trigger conditions while largely preserving normal ranking on clean inputs.

What carries the argument

The tail-aware ranking loss that focuses optimization on decision-critical trajectories, together with dual gating mechanisms that stabilize training and control application of the attack penalty.

If this is right

  • Under trigger conditions, planning outcomes are redirected toward attacker-chosen behaviors.
  • On clean inputs, the original ranking structure and task performance remain largely unchanged.
  • The attack produces sustained behavioral deviations and measurable performance drops on DreamerV3 and TD-MPC2 across multiple tasks.
  • Existing backdoor techniques aimed at features or one-step predictions are insufficient for world-model planners.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses will need to regularize or monitor the internal ranking distribution rather than only the input or final policy.
  • Any planning system that selects actions by ranking many generated candidates may inherit similar tail-based vulnerabilities.
  • Training procedures that deliberately flatten the trajectory-value distribution could reduce attack surface at the cost of planning efficiency.
  • The same ranking-attack pattern could be tested in non-RL generative planners that rely on internal simulation and selection.

Load-bearing premise

The long-tailed ranking structure of imagined trajectories stays stable enough on clean data to preserve performance yet remains fragile enough that a trigger can reorder the critical tail without being absorbed by the learned dynamics.

What would settle it

A test in which the trigger is applied yet the world model still selects and executes the original highest-ranked trajectories, showing that the dynamics prior overrode the ranking change.

Figures

Figures reproduced from arXiv: 2605.01950 by Ke Zhang, Siyuan Duan, Xizhao Luo.

Figure 1
Figure 1. Figure 1: Motivation of TRAP. (a) Under clean conditions, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of TRAP. The trigger selectively suppresses decision-critical tail trajectories, shifts trajectory ranking, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation and mechanistic analysis of TRAP. (a) Component ablation under [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diagnostic experiments and defense analysis of TRAP. (a) Effect of patch size ratio on Attack Success Rate (ASR) and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

World models enable long-horizon planning by internally generating and evaluating imagined trajectories, making them a promising foundation for generalist agents. However, this imagination-driven decision process also introduces new security risks. Existing backdoor attacks typically aim to manipulate local features, one-step predictions, or instantaneous policy outputs. While such objectives may suffice for weaker reactive models, they are often ineffective against world models, where the learned dynamics prior and planning process can absorb or wash out the effects of shallow perturbations. More importantly, we find that world models exhibit a distinct backdoor vulnerability rooted in the long-tailed ranking structure of imagined trajectories, where disrupting the ordering of a few decision-critical trajectories can systematically hijack planning. To exploit this vulnerability, we propose TRAP, a backdoor attack framework for world models that targets imagined trajectory ranking. TRAP combines a tail-aware ranking loss to focus optimization on decision-critical trajectories with dual gating mechanisms that stabilize optimization and regulate when and where the attack penalty is applied. Under trigger conditions, TRAP alters the relative ranking of imagined trajectories to redirect planning outcomes, while largely maintaining the normal ranking structure on clean inputs. Experiments on DreamerV3 and TD-MPC2 across diverse tasks show that TRAP consistently induces sustained behavioral deviations and significant performance degradation, highlighting the need for dedicated security evaluation of world-model-based agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that world models for planning exhibit a distinct backdoor vulnerability rooted in the long-tailed ranking structure of imagined trajectories, where reordering a few decision-critical trajectories can hijack planning outcomes. It proposes TRAP, which combines a tail-aware ranking loss focused on critical trajectories with dual gating mechanisms to stabilize optimization and control attack application. Under trigger conditions, the attack redirects planning while largely preserving clean ranking and performance; experiments on DreamerV3 and TD-MPC2 across diverse tasks demonstrate sustained behavioral deviations with limited clean degradation.

Significance. If the results hold, the work is significant for identifying a planning-specific vulnerability in world models that differs from local-feature or one-step attacks on reactive models. The concrete loss formulation, gating logic, and empirical validation on two state-of-the-art algorithms provide reproducible evidence of the vulnerability and underscore the need for dedicated security evaluation of imagination-driven agents.

minor comments (3)
  1. [Abstract] Abstract: the phrase 'diverse tasks' is used without enumeration; specifying the task suite (e.g., by name or category) would improve immediate readability.
  2. [§3] §3 (Method): the dual-gating logic is described in prose; inclusion of a compact pseudocode block or flowchart would clarify the timing and scope of the attack penalty.
  3. [§4] §4 (Experiments): while results are reported, moving key baseline comparisons and trigger-design details from the appendix into the main text would strengthen the central empirical narrative.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work on TRAP and for recommending minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical attack proposal with no reductive derivations

full rationale

The manuscript frames TRAP as an empirical backdoor attack on world-model planners, supported by concrete loss formulations, gating mechanisms, and experimental results on DreamerV3 and TD-MPC2. No equations, uniqueness theorems, or derivation chains are present that reduce the claimed vulnerability or attack success to fitted parameters, self-citations, or ansatzes by construction. The long-tailed ranking observation is presented as an empirical finding rather than a self-defined premise, and clean-performance preservation is validated externally via reported metrics rather than forced by the method definition. This satisfies the default expectation of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the empirical observation of long-tailed trajectory ranking, treated as a domain property rather than a derived quantity.

pith-pipeline@v0.9.0 · 5536 in / 1112 out tokens · 54084 ms · 2026-05-10T15:11:08.333431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 27 canonical work pages · 5 internal anchors

  1. [1]

    Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018. Syn- thesizing robust adversarial examples. InInternational conference on machine learning. PMLR, 284–293

  2. [2]

    Fengshuo Bai, Runze Liu, Yali Du, Ying Wen, and Yaodong Yang. 2025. Rat: Adversarial attacks on deep reinforcement agents for targeted behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 15453–15461

  3. [3]

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. 2013. The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research47 (2013), 253–279

  4. [4]

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. 2024. Video generation models as world simulators.OpenAI Blog1, 8 (2024), 1

  5. [5]

    Tom B Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer

  6. [6]

    Adversarial patch.arXiv preprint arXiv:1712.09665(2017)

  7. [7]

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. 2024. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning

  8. [8]

    Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. 2022. Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481(2022)

  9. [9]

    Jie Cheng, Ruixi Qiao, Yingwei Ma, Binhua Li, Gang Xiong, Qinghai Miao, Yong- bin Li, and Yisheng Lv. 2024. Scaling offline model-based rl via jointly-optimized world-action model pretraining.arXiv preprint arXiv:2410.00564(2024)

  10. [10]

    Jing Cui, Yufei Han, Yuzhe Ma, Jianbin Jiao, and Junge Zhang. 2024. Badrl: Sparse targeted backdoor attack against reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 11687–11694

  11. [11]

    Chen Gong, Zhou Yang, Yunpeng Bai, Junda He, Jieke Shi, Kecen Li, Arunesh Sinha, Bowen Xu, Xinwen Hou, David Lo, et al. 2024. Baffle: Hiding backdoors in offline reinforcement learning datasets. In2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2086–2104

  12. [12]

    Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2019. Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access7 (2019), 47230–47244

  13. [13]

    Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens Van Der Maaten

  14. [14]

    Countering adversarial images using input transformations.arXiv preprint arXiv:1711.00117(2017)

  15. [15]

    Ji Guo, Wenbo Jiang, Yansong Lin, Yijing Liu, Ruichen Zhang, Guomin Lu, Aiguo Chen, Xinshuo Han, Hongwei Li, and Dusit Niyato. 2026. State Backdoor: To- wards Stealthy Real-world Poisoning Attack on Vision-Language-Action Model in State Space.arXiv preprint arXiv:2601.04266(2026)

  16. [16]

    Alexander Gushchin, Khaled Abud, Georgii Bychkov, Ekaterina Shumitskaya, Anna Chistyakova, Sergey Lavrushkin, Bader Rasheed, Kirill Malyshev, Dmitriy Vatolin, and Anastasia Antsiferova. 2024. Guardians of image quality: Bench- marking defenses against adversarial attacks on image quality metrics.arXiv preprint arXiv:2408.01541(2024)

  17. [17]

    David Ha and Jürgen Schmidhuber. 2018. World models.arXiv preprint arXiv:1803.101222, 3 (2018), 440

  18. [18]

    Danijar Hafner. 2021. Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780(2021)

  19. [19]

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. 2019. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603(2019)

  20. [20]

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. 2019. Learning latent dynamics for planning from pixels. InInternational conference on machine learning. PMLR, 2555–2565

  21. [21]

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. 2020. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193 (2020)

  22. [22]

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. 2023. Mas- tering diverse domains through world models.arXiv preprint arXiv:2301.04104 (2023)

  23. [23]

    Nicklas Hansen, Hao Su, and Xiaolong Wang. 2023. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828(2023)

  24. [24]

    Nicklas Hansen, Xiaolong Wang, and Hao Su. 2022. Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955(2022)

  25. [25]

    Haoran He, Yang Zhang, Liang Lin, Zhongwen Xu, and Ling Pan. 2026. Pre- trained video generative models as world simulators. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 4645–4653

  26. [26]

    Yibin Hu, Xiaolin Sun, and Zizhan Zheng. [n. d.]. Stealthy World Model Manipu- lation via Data Poisoning. ([n. d.])

  27. [27]

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. 2024. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385(2024)

  28. [28]

    Danny Karmon, Daniel Zoran, and Yoav Goldberg. 2018. Lavan: Localized and visible adversarial noise. InInternational conference on machine learning. PMLR, 2507–2515

  29. [29]

    Panagiota Kiourti, Kacper Wardega, Susmit Jha, and Wenchao Li. 2020. Trojdrl: evaluation of backdoor attacks on deep reinforcement learning. In2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6

  30. [30]

    Jiajian Li, Qi Wang, Yunbo Wang, Xin Jin, Yang Li, Wenjun Zeng, and Xiaokang Yang. 2024. Open-world reinforcement learning over long short-term imagina- tion.arXiv preprint arXiv:2410.03618(2024)

  31. [31]

    Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2022. Backdoor learning: A survey.IEEE transactions on neural networks and learning systems35, 1 (2022), 5–22

  32. [32]

    Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, and Min Sun. 2017. Tactics of adversarial attack on deep reinforcement learning agents.arXiv preprint arXiv:1703.06748(2017)

  33. [33]

    Fan-Ming Luo, Tian Xu, Hang Lai, Xiong-Hui Chen, Weinan Zhang, and Yang Yu. 2024. A survey on model-based reinforcement learning.Science China Information Sciences67, 2 (2024), 121101

  34. [34]

    Oubo Ma, Linkang Du, Yang Dai, Chunyi Zhou, Qingming Li, Yuwen Pu, and Shouling Ji. 2025. Unidoor: A universal framework for action-level backdoor attacks in deep reinforcement learning.arXiv preprint arXiv:2501.15529(2025)

  35. [35]

    Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, and Sven Behnke. 2024. Sold: Slot object-centric latent dynamics models for relational manipulation learning from pixels.arXiv preprint arXiv:2410.08822(2024)

  36. [36]

    Pietro Novelli, Marco Pratticò, Massimiliano Pontil, and Carlo Ciliberto. 2024. Op- erator world models for reinforcement learning.Advances in Neural Information Processing Systems37 (2024), 111432–111463

  37. [37]

    Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Hol- sheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, et al . 2024. Genie 2: A large-scale foundation world model. URL: https://deepmind. google/discover/blog/genie-2-a-large-scale-foundation- world-model2 (2024)

  38. [38]

    Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017. Robust adversarial reinforcement learning. InInternational conference on machine learning. PMLR, 2817–2826

  39. [39]

    Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. 2024. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072(2024)

  40. [40]

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al

  41. [41]

    Deepmind control suite.arXiv preprint arXiv:1801.00690(2018)

  42. [42]

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Boyuan Wang, Xinze Chen, and Jiwen Lu. 2024. Worlddreamer: Towards general world models for video generation via predicting masked tokens.arXiv preprint arXiv:2401.09985(2024)

  43. [43]

    Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long. 2025. Rlvr- world: Training world models with reinforcement learning.arXiv preprint arXiv:2505.13934(2025)

  44. [44]

    Weilin Xu, David Evans, and Yanjun Qi. 2017. Feature squeezing: Detecting adversarial examples in deep neural networks.arXiv preprint arXiv:1704.01155 (2017)

  45. [45]

    Zonghuan Xu, Xiang Zheng, Xingjun Ma, and Yu-Gang Jiang. 2025. TabVLA: Targeted Backdoor Attacks on Vision-Language-Action Models.arXiv preprint arXiv:2510.10932(2025)

  46. [46]

    Mingfu Xue, Xin Wang, Shichang Sun, Yushu Zhang, Jian Wang, and Weiqiang Liu. 2023. Compression-resistant backdoor attack against deep neural networks. Applied Intelligence53, 17 (2023), 20402–20417. Duan et al

  47. [47]

    Weipu Zhang, Adam Jelley, Trevor McInroe, and Amos Storkey. 2025. Objects matter: object-centric world models improve reinforcement learning in visually complex environments.arXiv preprint arXiv:2501.16443(2025)

  48. [48]

    Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. 2023. Storm: Efficient stochastic transformer based world models for reinforcement learning. Advances in Neural Information Processing Systems36 (2023), 27147–27166

  49. [49]

    Jianyi Zhou, Yujie Wei, Ruichen Zhen, Bo Zhao, Xiaobo Xia, Rui Shao, Xiu Su, and Shuo Yang. 2026. Inject Once Survive Later: Backdooring Vision-Language- Action Models to Persist Through Downstream Fine-tuning.arXiv preprint arXiv:2602.00500(2026)

  50. [50]

    Xueyang Zhou, Guiyao Tie, Guowen Zhang, Hechang Wang, Pan Zhou, and Lichao Sun. 2025. Badvla: Towards backdoor attacks on vision-language-action models via objective-decoupled optimization.arXiv preprint arXiv:2505.16640 (2025)