Stealthy World Model Manipulation via Data Poisoning

Xiaolin Sun; Yibin Hu; Zizhan Zheng

arxiv: 2606.18697 · v2 · pith:4O2UA7QRnew · submitted 2026-06-17 · 💻 cs.LG · cs.CR· cs.RO

Stealthy World Model Manipulation via Data Poisoning

Yibin Hu , Xiaolin Sun , Zizhan Zheng This is my paper

Pith reviewed 2026-06-26 21:31 UTC · model grok-4.3

classification 💻 cs.LG cs.CRcs.RO

keywords data poisoningworld modelsmodel-based reinforcement learningadversarial attackscontinuous controlstealthy attacksbilevel optimizationgradient matching

0 comments

The pith

A two-stage data poisoning attack can manipulate learned world models to induce low-return planning while keeping changes close to clean data and evading non-adaptive defenses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SWAAP as a two-stage framework to poison fine-tuning trajectories for model-based agents. The first stage uses bilevel optimization to locate a harmful target world model that stays near clean dynamics yet produces poor planning returns. The second stage applies gradient matching under stealth constraints and a prediction-error regularizer to steer training toward that target with only limited modifications. A sympathetic reader would care because world-model updates from experience are central to adaptation, so successful poisoning would corrupt downstream behavior without clear data anomalies. The results indicate this occurs across continuous-control tasks while bypassing the tested defenses.

Core claim

SWAAP identifies a harmful target world model via first-order bilevel optimization enabled by a transition-gradient theorem, where the target induces low-return behavior under planning yet remains close to clean dynamics; it then realizes this target by modifying a limited fraction of fine-tuning transition targets through stealth-constrained gradient matching and a prediction-error regularizer so that the induced gradients steer the victim model toward the target while poisoned data stays close to the model's natural approximation error.

What carries the argument

SWAAP, the two-stage data poisoning framework that locates a harmful target world model through bilevel optimization and realizes it via gradient matching on limited transitions.

If this is right

Substantial performance degradation occurs across diverse continuous-control tasks.
Poisoned transitions remain close to clean data.
Evaluated non-adaptive residual, CUSUM, and TRIM-style defenses are evaded at pre-training, fine-tuning, and test-time stages.
This reveals a practical vulnerability in world-model adaptation pipelines.
Robustness methods are needed to protect both world-model training data and learned dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses would need to consider bilevel optimization when screening for harmful targets rather than only statistical residuals.
The attack surface may extend to other online adaptation loops where agents collect their own fine-tuning data.
Test-time monitoring of dynamics might require additional checks beyond prediction-error thresholds to catch induced low-return behavior.
Practitioners adapting models in deployed systems may need controlled data collection or verification steps to limit poisoning risk.

Load-bearing premise

The transition-gradient theorem enables first-order bilevel optimization to identify a harmful target world model close to clean dynamics yet inducing low-return behavior under planning.

What would settle it

An experiment in which the bilevel optimization yields no target model that is simultaneously close to clean dynamics and low-return under planning, or in which gradient matching on a limited fraction of transitions fails to steer the victim model without larger deviations or detection.

Figures

Figures reproduced from arXiv: 2606.18697 by Xiaolin Sun, Yibin Hu, Zizhan Zheng.

**Figure 1.** Figure 1: Pipeline of SWAAP. In our main experiments, we assume white-box access to the victim world-model architecture, pretrained weights ψ0, and fine-tuning loss. We use this as an attack feasibility study and a worstcase robustness evaluation, analogous to whitebox adversarial evaluation in supervised learning; it is also relevant to open-weight world models and downstream adaptation settings. The attacker may… view at source ↗

**Figure 2.** Figure 2: SWAAP versus SWAAP (Random) across stealth weights α with rp = 0.1. SWAAP generally causes larger return drops under smaller stealth weights. 0.1 0.2 0.3 0.4 0.5 m 0 200 400 600 800 1000 Return Humanoid-Walk SWAAP Direct Model Poisoning SWAAP (Random) Clean Return Clean m, ref (a) Humanoid-Walk frontier 0.0 0.1 0.2 0.3 0.4 0.5 m 0 1000 2000 3000 4000 5000 Return Myo-Pen-Twirl-Hard SWAAP Direct Model Poison… view at source ↗

**Figure 3.** Figure 3: (a) and (b) Attack-stealth frontier: DMP evaluates saved Stage 1 target models directly, while SWAAP applies Stage 2 data poisoning toward selected targets. (c) Robust-training defense: on Humanoid-Walk, SWAAP (rp, α = 0.1, 0.9) remains effective under TRIM robust training (β = 0.2). Second, Appendices G.1 and G.2 test Stage 2 design choices and alternative baselines, respectively. Replacing the top-rp tra… view at source ↗

**Figure 4.** Figure 4: SWAAP and SWAAP (Random) across poisoning ratios [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Model poisoning curves (referred to as Direct Model Poisoning in Table 1) of return and [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Trajectory-level ROC curves for the residual-based CUSUM detector on MW-Push. Each [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Pre-training data-level deviation across stealth weights [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Latent-space PCA visualization on Humanoid-Walk. (a) compares poisoned next-state [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

read the original abstract

Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWAAP introduces a two-stage poisoning attack on world models via bilevel optimization and gradient matching, but the load-bearing transition-gradient theorem lacks any derivation or verification in the provided material.

read the letter

The paper's main point is a new attack pipeline called SWAAP that first solves a bilevel problem to pick a target world model close to the clean one yet bad for planning, then poisons a limited set of transition targets so the victim's training gradients move toward that target while a regularizer keeps the poisons looking natural.

What is actually new is the specific two-stage split: bilevel optimization for the target plus stealth-constrained gradient matching on the transition outputs. Prior poisoning work on RL has used direct data injection or reward tampering, but this combination aimed at world-model fine-tuning trajectories does not appear in the cited literature. The framing of the attack surface in model-based adaptation pipelines is also clear and points to a real deployment concern for robotics and control agents.

The soft spot is exactly the one the stress-test flags. The first stage only works if the transition-gradient theorem lets first-order bilevel optimization find a target that is simultaneously close to clean dynamics and harmful under planning. The abstract asserts the theorem enables this but supplies no derivation, assumptions, or empirical check that the optimized targets satisfy both conditions at once. Without that, the second stage has nothing concrete to aim for. The claimed results on performance drop and defense evasion are described only at a high level with no numbers or error bars, so the practical strength of the attack remains unshown.

This paper is for people working on RL robustness and poisoning defenses. A reader already thinking about adaptation pipelines would get value from the attack surface description and the defense evaluation setup. It deserves a serious referee because the vulnerability it identifies is concrete and within the subfield's scope, even though the current draft needs the theorem and the experimental details filled in before the claims can be evaluated.

Referee Report

1 major / 2 minor

Summary. The paper proposes SWAAP, a two-stage data poisoning attack on learned world models for model-based RL agents. Stage 1 identifies a harmful target world model (close to clean dynamics yet inducing low-return planning) via first-order bilevel optimization enabled by a transition-gradient theorem. Stage 2 realizes the target by poisoning a limited fraction of fine-tuning transitions through stealth-constrained gradient matching plus a prediction-error regularizer. Across continuous-control tasks, the attack is claimed to cause substantial performance degradation while evading non-adaptive residual/CUSUM/TRIM defenses and keeping poisoned transitions close to clean data.

Significance. If the transition-gradient theorem holds and the experimental results are reproducible, the work identifies a practical attack surface in world-model fine-tuning pipelines. The bilevel optimization plus gradient-matching construction is a novel contribution to poisoning attacks on dynamics models, and the multi-stage stealth evaluation (pre-training detection, robust training, test-time monitoring) strengthens the claim of practicality. This could motivate new defenses focused on both data and learned dynamics.

major comments (1)

[Transition-gradient theorem (abstract and derivation section)] The transition-gradient theorem is load-bearing for Stage 1: the abstract states that it enables first-order bilevel optimization to produce a target model that is simultaneously close to clean dynamics and harmful under planning. No derivation, assumptions, or empirical verification that the resulting targets satisfy both conditions is provided, so the two-stage attack cannot be realized if the theorem does not hold in the claimed form.

minor comments (2)

[Abstract] The abstract asserts 'substantial performance degradation' and 'evading the evaluated defenses' but reports no quantitative metrics, poisoning fractions, error bars, or specific task results; these numbers belong in the results section or abstract.
Notation for the bilevel objective, the gradient-matching loss, and the prediction-error regularizer should be introduced with explicit equations early in the method section for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for highlighting the importance of the transition-gradient theorem. We agree that its derivation, assumptions, and verification are essential for the two-stage attack and will strengthen the manuscript by including them.

read point-by-point responses

Referee: [Transition-gradient theorem (abstract and derivation section)] The transition-gradient theorem is load-bearing for Stage 1: the abstract states that it enables first-order bilevel optimization to produce a target model that is simultaneously close to clean dynamics and harmful under planning. No derivation, assumptions, or empirical verification that the resulting targets satisfy both conditions is provided, so the two-stage attack cannot be realized if the theorem does not hold in the claimed form.

Authors: We agree that the transition-gradient theorem requires explicit support. The current manuscript states the theorem but omits the full derivation for space reasons. In the revision we will add: (i) the complete proof with all assumptions (differentiability of the planning objective w.r.t. dynamics parameters, Lipschitz continuity of the transition function, and bounded planning horizon), (ii) a dedicated subsection deriving the first-order bilevel gradient, and (iii) empirical verification on the evaluated environments showing that the bilevel-optimized targets remain within a small L2 distance of the clean model while producing substantially lower returns under the planner. These additions will confirm that the Stage-1 targets satisfy both closeness and harmfulness conditions, making the two-stage attack realizable. revision: yes

Circularity Check

0 steps flagged

No circularity: attack defined via procedural optimization, not by construction or self-citation reduction

full rationale

The paper defines SWAAP as a two-stage procedural attack: stage 1 uses first-order bilevel optimization (enabled by a transition-gradient theorem) to identify a target model, and stage 2 uses gradient matching with a regularizer. No step renames a fitted parameter as a prediction, imports uniqueness via self-citation, or reduces the central result to its own inputs by definition. The transition-gradient theorem is invoked as an enabler for the optimization procedure rather than a self-referential result; empirical evaluations on continuous-control tasks provide independent falsifiability. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The attack rests on the existence of a transition-gradient theorem that allows first-order bilevel optimization and on the assumption that a limited poisoning budget suffices to steer the model; no new physical entities are introduced.

free parameters (2)

poisoning fraction
The limited fraction of fine-tuning transitions that are modified; chosen to balance attack success and stealth.
prediction-error regularizer weight
Controls how close poisoned targets must stay to the world model's natural approximation error.

axioms (1)

domain assumption A transition-gradient theorem exists that enables first-order bilevel optimization over world-model parameters.
Invoked to make the target-model identification tractable in stage one.

pith-pipeline@v0.9.1-grok · 5797 in / 1363 out tokens · 21653 ms · 2026-06-26T21:31:38.267575+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Bruce, M

URL https://arxiv.org/ abs/2402.15391. Vittorio Caggiano, Huawei Wang, Guillaume Durandau, Massimo Sartori, and Vikash Kumar. MyoSuite: A contact-rich simulation suite for musculoskeletal motor control. InAnnual Learning for Dynamics and Control Conference,

work page arXiv
[2]

Saeed Ghadimi and Mengdi Wang

URL https://arxiv.org/abs/2009.02276. Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming.arXiv preprint arXiv:1802.02246,

work page arXiv 2009
[3]

Maniskill2: A unified benchmark for generalizable manipulation skills, 2023.https://arxiv.org/abs/2302.04659

URL https: //arxiv.org/abs/2302.04659. David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122,

work page arXiv
[4]

Mastering Diverse Domains through World Models

URLhttps://arxiv.org/abs/2301.04104. Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions.arXiv preprint arXiv:1903.03088,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[6]

Cosmos World Foundation Model Platform for Physical AI

Nvidia. Cosmos world foundation model platform for Physical AI. https://arxiv.org/abs/ 2501.03575, 2025a. Nvidia. World foundation models. https://www.nvidia.com/en-us/glossary/ world-models/, 2025b. OpenAI. Creating video from text.https://openai.com/index/sora/,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.arXiv preprint arXiv:1801.00690,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

URLhttps://arxiv.org/abs/2411.04983. 13 Appendix Table of Contents A Use of LLMs 16 B Broader Impact 16 C Related Works 16 C.1 World Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 C.2 Model-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 17 C.3 Model Poisoning Attack . . . . . . . . . . . ....

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Therefore, ∇ψJ(P ψ, π) = X s dπ(s) X a π(a|s) X s′ ∇ψPψ(s′|s, a)(R(s, a, s′) +γV(s ′)) =E (s,a,r,s′)∼Pψ,π[(R(s, a, s′) +γV(s ′))∇ψ logP ψ(s′|s, a)]

= P s′ ρπ(s→s ′, k)ρπ(s′ →x,1) is used to write the transition gradient into recursive form. Therefore, ∇ψJ(P ψ, π) = X s dπ(s) X a π(a|s) X s′ ∇ψPψ(s′|s, a)(R(s, a, s′) +γV(s ′)) =E (s,a,r,s′)∼Pψ,π[(R(s, a, s′) +γV(s ′))∇ψ logP ψ(s′|s, a)]. For planning with a finite horizon T , given a time-indexed value function Vt(st) with terminal value VT (sT ) = 0,...

2013
[10]

Other parameters are the same as stated in TD-MPC2 [Hansen et al., 2024]. 22 Table 2: Key hyperparameters used in experiments; other parameters are the same as in TD-MPC2 Hyperparameter Typical value(s) Description Model Poisoning Hyperparameters obs_dim 17-223 The range of observation dimensions for tasks action_dim 6-39 The range of action dimensions fo...

2024
[11]

perturbing a fraction of state-action pairs

Stage 1 optimizes the parameters of a global neural world model, so changing ˆψ can in principle affect predictions beyond any selected subset of inputs. In this ablation, 30 Table 7: Attack performance with different data poison ratio (rp) in Stage 2 in Humanoid-Walk with α= 0.9 rp Return δm clean 866 ± 57 0.09 ± 0.05 0.01 844 ± 94 0.10 ± 0.05 0.05 739 ±...

2022
[12]

For Obj-Hold-Hard, Key-Turn, and Coffee-Pull, we set α= 0.9 , while for Pose and Door-Open, we use α= 0.1

All experiments use rp = 0.1 and are evaluated over 100 episodes. For Obj-Hold-Hard, Key-Turn, and Coffee-Pull, we set α= 0.9 , while for Pose and Door-Open, we use α= 0.1 . These results further confirm that SWAAP reliably reduces the return while keeping the poisoned world model close to the clean model, consistent with our main findings. H.2 DINO-WM Pu...

2025

[1] [1]

Bruce, M

URL https://arxiv.org/ abs/2402.15391. Vittorio Caggiano, Huawei Wang, Guillaume Durandau, Massimo Sartori, and Vikash Kumar. MyoSuite: A contact-rich simulation suite for musculoskeletal motor control. InAnnual Learning for Dynamics and Control Conference,

work page arXiv

[2] [2]

Saeed Ghadimi and Mengdi Wang

URL https://arxiv.org/abs/2009.02276. Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming.arXiv preprint arXiv:1802.02246,

work page arXiv 2009

[3] [3]

Maniskill2: A unified benchmark for generalizable manipulation skills, 2023.https://arxiv.org/abs/2302.04659

URL https: //arxiv.org/abs/2302.04659. David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122,

work page arXiv

[4] [4]

Mastering Diverse Domains through World Models

URLhttps://arxiv.org/abs/2301.04104. Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks: Bilevel optimization of hyperparameters using structured best-response functions.arXiv preprint arXiv:1903.03088,

work page internal anchor Pith review Pith/arXiv arXiv 1903

[6] [6]

Cosmos World Foundation Model Platform for Physical AI

Nvidia. Cosmos world foundation model platform for Physical AI. https://arxiv.org/abs/ 2501.03575, 2025a. Nvidia. World foundation models. https://www.nvidia.com/en-us/glossary/ world-models/, 2025b. OpenAI. Creating video from text.https://openai.com/index/sora/,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.arXiv preprint arXiv:1801.00690,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

URLhttps://arxiv.org/abs/2411.04983. 13 Appendix Table of Contents A Use of LLMs 16 B Broader Impact 16 C Related Works 16 C.1 World Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 C.2 Model-Based Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 17 C.3 Model Poisoning Attack . . . . . . . . . . . ....

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Therefore, ∇ψJ(P ψ, π) = X s dπ(s) X a π(a|s) X s′ ∇ψPψ(s′|s, a)(R(s, a, s′) +γV(s ′)) =E (s,a,r,s′)∼Pψ,π[(R(s, a, s′) +γV(s ′))∇ψ logP ψ(s′|s, a)]

= P s′ ρπ(s→s ′, k)ρπ(s′ →x,1) is used to write the transition gradient into recursive form. Therefore, ∇ψJ(P ψ, π) = X s dπ(s) X a π(a|s) X s′ ∇ψPψ(s′|s, a)(R(s, a, s′) +γV(s ′)) =E (s,a,r,s′)∼Pψ,π[(R(s, a, s′) +γV(s ′))∇ψ logP ψ(s′|s, a)]. For planning with a finite horizon T , given a time-indexed value function Vt(st) with terminal value VT (sT ) = 0,...

2013

[10] [10]

Other parameters are the same as stated in TD-MPC2 [Hansen et al., 2024]. 22 Table 2: Key hyperparameters used in experiments; other parameters are the same as in TD-MPC2 Hyperparameter Typical value(s) Description Model Poisoning Hyperparameters obs_dim 17-223 The range of observation dimensions for tasks action_dim 6-39 The range of action dimensions fo...

2024

[11] [11]

perturbing a fraction of state-action pairs

Stage 1 optimizes the parameters of a global neural world model, so changing ˆψ can in principle affect predictions beyond any selected subset of inputs. In this ablation, 30 Table 7: Attack performance with different data poison ratio (rp) in Stage 2 in Humanoid-Walk with α= 0.9 rp Return δm clean 866 ± 57 0.09 ± 0.05 0.01 844 ± 94 0.10 ± 0.05 0.05 739 ±...

2022

[12] [12]

For Obj-Hold-Hard, Key-Turn, and Coffee-Pull, we set α= 0.9 , while for Pose and Door-Open, we use α= 0.1

All experiments use rp = 0.1 and are evaluated over 100 episodes. For Obj-Hold-Hard, Key-Turn, and Coffee-Pull, we set α= 0.9 , while for Pose and Door-Open, we use α= 0.1 . These results further confirm that SWAAP reliably reduces the return while keeping the poisoned world model close to the clean model, consistent with our main findings. H.2 DINO-WM Pu...

2025