pith. machine review for the scientific record. sign in

arxiv: 2605.06298 · v2 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords world modelsimplicit neural representationsvideo predictiondisentanglementlatent dynamicscoordinate-based networksaction matchingdecoder-free rendering
0
0 comments X

The pith

A world model stores each video state as the weights of a small implicit neural network and renders frames analytically from those weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NOVA, which encodes video frames not as opaque latent vectors but directly as the parameters of an auxiliary coordinate-based implicit neural representation. Because the representation can be rendered by querying the network at any coordinate, the model avoids a separate decoder entirely. This design yields compact, portable states that support zero-shot super-resolution and controllable forecasting after distillation into a video generator. Without auxiliary losses or adversarial training, the same weight-space states spontaneously separate background, foreground, and motion, allowing independent editing of content and dynamics. The framework runs at roughly 40 million parameters on a single consumer GPU across several video datasets.

Core claim

NOVA represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising t

What carries the argument

The INR weight-and-bias vector that serves as the world-model state; it is rendered by direct coordinate-wise evaluation of the implicit network rather than decoded by a separate network.

If this is right

  • World models become smaller and portable because the state is a modest set of network weights instead of a full decoder.
  • Zero-shot super-resolution follows directly from rendering the same weights at higher coordinate resolution.
  • Independent editing of content versus dynamics becomes possible once background, foreground, and motion are disentangled in weight space.
  • The same weight states can be distilled into a context-dependent video generator using only an action-matching loss.
  • Training and inference fit on a single consumer GPU at approximately 40 million total parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same weight-space encoding could be applied to 3-D scene modeling by replacing 2-D coordinate queries with volumetric ones.
  • Disentangled motion weights might serve as editable control signals for physics-based simulators or game engines.
  • Because the state size is independent of image resolution, the approach may scale to higher-resolution or longer video sequences without proportional memory growth.

Load-bearing premise

Storing video dynamics inside the weights of a coordinate-based neural representation will capture motion accurately enough for controllable forecasting and will separate scene structure from dynamics without extra supervision.

What would settle it

On a held-out video dataset containing rapid object occlusions or camera motion, the model produces either inaccurate future frames under action control or entangled edits when users attempt to change only the background.

Figures

Figures reproduced from arXiv: 2605.06298 by Mauro Comi, Roussel Desmond Nzoyem.

Figure 1
Figure 1. Figure 1: Video editing facilitated by object-motion disentanglement. Conditioned on the same initial frame, all models must generate future states; at t = 6, this state is abruptly replaced with an alien frame’s encoding. (a) The standard world model (WM), trained in an abstract latent space with NOVA’s decomposition strategy, manages to separate content from motion. (b) LAPO [Schmidt and Jiang, 2023] generates cle… view at source ↗
Figure 2
Figure 2. Figure 2: NOVA architecture. The shared Encoder (red) maps frames to weight-space offsets zt. During training, the IDM (green) infers latent action ut from consecutive encoded states. The FDM (blue) predicts the next weight offset via the additive mapping A(zt) + B(ut). Unlike conventional decoders, the Renderer (red) is not trained; it is an analytical function that maps a coordinate grid (X, Y) to pixel values via… view at source ↗
Figure 3
Figure 3. Figure 3: Action-matching training objective during phase 3. The GCM regresses onto the pseudo-action ut generated by the frozen IDM. This latent-space loss is highly efficient (dz ≪ H × W × C) and prevents the model from wasting capacity on irrelevant pixel-level features, similar to [Assran et al., 2023]. That said, if phase 1 is skipped, pixel-space reconstruction terms can be used instead of L2. For instance, in… view at source ↗
Figure 4
Figure 4. Figure 4: Robustness to minimal context. Top row: ground truth (GT); bottom row: L = 1 context frame. Even under this extreme constraint, NOVA produces coherent, near-perfect predictions. this background is absorbed into z¯, as its direct rendering leads to black pixels (see view at source ↗
Figure 5
Figure 5. Figure 5: , digit identities gradually morph towards a resting state (8s and 9s), all the while motion trajectories are preserved. We emphasise that this is achieved without any training supervision on digit identity. This finding is complemented by view at source ↗
Figure 7
Figure 7. Figure 7: Zero-shot super-resolution on WeatherBench. Ground truth at native 32 × 64 resolution next to nearest-neighbour, bilinear interpolation, and NOVA at ×4 and ×32 scaling. NOVA avoids aliasing and preserves macro-structure without hallucinating high-frequency textures view at source ↗
Figure 8
Figure 8. Figure 8: Standard latent action model (LAM). The IDM computes the action from consecutive latent states during training. This approach relies on a spatial decoder to project the latent state back into pixel space, creating a resolution bottleneck (e.g., [Schmidt and Jiang, 2023]). Note that the encoder and the decoder can be identity mappings as well, to match the setting of [Zhang et al., 2025a]. Disentanglement i… view at source ↗
Figure 9
Figure 9. Figure 9: Two paradigms for signal encoding. Ab￾stract encoders map input o to an opaque latent z requiring a trained decoder. Structured encoders map o to INR weights z, which are subsequently rendered analytically view at source ↗
Figure 10
Figure 10. Figure 10: Detailed NOVA components as used on the WeatherBench dataset, corresponding to the experimental configuration (dz = 961, du = 16, and C = 1). Architecture of the GCM. We employ Transformer architectures exclusively for the WeatherBench and MiniGrid datasets, whereas we utilise a GRU for Moving MNIST and an LSTM for PhyWorld. Because the latter two datasets serve as the basis for our intervention experimen… view at source ↗
Figure 11
Figure 11. Figure 11: Background isolation on Moving MNIST. Left: one frame from the dataset, rendered with view at source ↗
Figure 12
Figure 12. Figure 12: NOVA’s content identity is encoded in zt. After t = 10 (context phase), we interpolate zt → 0 while maintaining GCM-extracted actions ut. Digit identities gradually morph towards the base-network prototype (8s and 9s), while motion trajectories are preserved. No supervision on digit identity was provided. D.2 Context-Conditioned Video Generation D.2.1 Moving MNIST Forecasting within the training horizon T… view at source ↗
Figure 13
Figure 13. Figure 13: NOVA’s spatial motion is encoded in ut. From t = 0, we interpolate ut → 0 using GCM-generated actions, while keeping zt free. Digits decelerate and converge towards the canvas centre, while their visual identities are fully preserved throughout. GT t = 1 2 3 4 5 6 7 8 9 10 t = 11 12 13 14 15 16 17 18 19 20 Pred GT Pred GT Pred GT Pred GT Pred GT Pred view at source ↗
Figure 14
Figure 14. Figure 14: Context-conditioned video generation on Moving MNIST. Frames 1–10 are context (ground truth); frames 11–20 are NOVA predictions. The model maintains digit identity and trajectory without artefacts. each case, NOVA tracks the ground truth trajectory faithfully, correctly predicting both position and, implicitly, the physical properties of the balls. This suggests that the weight-space representation genera… view at source ↗
Figure 15
Figure 15. Figure 15: Comparison of long-horizon forecast ability. The models are only shown the initial two view at source ↗
Figure 16
Figure 16. Figure 16: Long-horizon generation. Comparison of long forecast ability using several metrics. The models are only shown the initial two frames, and most predict from frame t = 3 to t = 1000. This complements view at source ↗
Figure 17
Figure 17. Figure 17: OOD generalisation on PhyWorld. Ground truth (GT) versus NOVA predictions for ball radii and velocities outside the training distribution view at source ↗
Figure 18
Figure 18. Figure 18: PhyWorld state intervention after 5 steps, with forecast generated using 3 context frames. Despite manually editing their identities, we observe accurate collision dynamics that respect the balls’ sizes and velocities in the injected states. The fact that the y-position of the alien balls is inherited (in addition to shape and colour) is addressed in Section E view at source ↗
Figure 19
Figure 19. Figure 19: PhyWorld action intervention after 5 steps, with forecast generated using 3 context frames. We use the same sequences as in view at source ↗
Figure 20
Figure 20. Figure 20: WeatherBench forecasting. Ground truth (GT) versus NOVA predictions conditioned on 12 initial frames (frames 1–12) and rolling out for 12 steps (frames 13–24). Spatial coord. Amplitude High frequency ξhigh (out-of-band) Low frequency ξlow (in-band) view at source ↗
Figure 21
Figure 21. Figure 21: Illustration of spatial aliasing. The out-of-band signal (red, dashed) coincides with the in-band signal (blue) at every training sample point (black dots). During training, however, the loss cannot distinguish the two; weights are optimised against noise, producing artefacts when the INR is queried at a finer out-of-band grid. To suppress these, we remark that entries j = 2k and j = 2k + 1 in γ(y) corres… view at source ↗
Figure 22
Figure 22. Figure 22: Zero-shot super-resolution on WeatherBench without frequency masking. Compared to view at source ↗
Figure 23
Figure 23. Figure 23: Latent action dimensions on MiniGrid. Discrete (Left): Raw (grey) and quantised (blue). Continuous (Right) The first two dimensions collectively encode something akin to the orientation of the agent, while the other two (not shown here) encode its location. We also implement the continuous variant as defined in view at source ↗
Figure 24
Figure 24. Figure 24: Goal-directed navigation within a discrete action space. The GCM, conditioned only on the initial frame, imitates the BFS policy to navigate towards the goal (green box). Ground truth (top row) vs. NOVA prediction (bottom row) per sequence view at source ↗
Figure 25
Figure 25. Figure 25: Goal-directed navigation within a continuous action space. Similar to view at source ↗
Figure 26
Figure 26. Figure 26: WARP forecasting results conditioned on t = 10 context frames. While the model tracks general temporal dynamics, it fails to maintain sharp visual fidelity, yielding blurred and ghosted predictions. Latent disentanglement analysis. To probe the underlying topology of WARP’s learned repre￾sentations, we performed the latent state intervention experiment from view at source ↗
Figure 27
Figure 27. Figure 27: Latent State Intervention in WARP. (a) The original context frame alongside the isolated view at source ↗
Figure 28
Figure 28. Figure 28: Phase 2 training loss with and without z¯ on Moving MNIST. 33 view at source ↗
Figure 29
Figure 29. Figure 29: Zero-shot content retargeting on PhyWorld. Intervention starts at view at source ↗
Figure 30
Figure 30. Figure 30: Additive vs. monolithic (joint) forward dynamics on PhyWorld. Although detrimental for view at source ↗
read the original abstract

Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at $\sim$40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces NOVA, a world modeling framework that represents system states as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This representation is analytically rendered to eliminate decoder bottlenecks, yielding compactness, portability, and zero-shot super-resolution. NOVA supports distillation into a context-dependent video generator via action-matching and, without auxiliary or adversarial losses, achieves unsupervised disentanglement of structural components such as background, foreground, and inter-frame motion. The approach is validated on multiple challenging datasets, delivering strong controllable forecasting at approximately 40M parameters on a single consumer GPU.

Significance. If the empirical claims hold, the work offers a potentially significant advance in efficient and interpretable world models by shifting state representation to INR weight space. The analytical rendering removes a common computational bottleneck, while the emergent disentanglement without extra objectives could enable more editable and customizable video prediction systems. The modest resource footprint further supports practical deployment in autonomous intelligence pipelines.

major comments (2)
  1. [§4] The central claim of reliable controllable forecasting and clean unsupervised disentanglement rests on the INR weight-space representation capturing video dynamics across datasets. The manuscript should include explicit quantitative metrics (e.g., prediction error, disentanglement scores) and ablations in the experiments section demonstrating that these properties arise from the INR formulation rather than dataset-specific artifacts or implicit regularization.
  2. [§3.3] The distillation into a video generator via action-matching is presented as straightforward, yet the manuscript must clarify whether this step preserves the claimed disentanglement properties or introduces trade-offs; a direct comparison of editing fidelity before and after distillation would strengthen the portability argument.
minor comments (2)
  1. [§3.1] Notation for the INR coordinate inputs and weight parameterization should be defined more explicitly in the methods to aid reproducibility.
  2. [Abstract] The abstract and introduction would benefit from naming the specific datasets used and citing the exact performance numbers rather than qualitative descriptors such as 'strong'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments, which have helped us strengthen the manuscript. We address each major comment below and have revised the paper accordingly.

read point-by-point responses
  1. Referee: [§4] The central claim of reliable controllable forecasting and clean unsupervised disentanglement rests on the INR weight-space representation capturing video dynamics across datasets. The manuscript should include explicit quantitative metrics (e.g., prediction error, disentanglement scores) and ablations in the experiments section demonstrating that these properties arise from the INR formulation rather than dataset-specific artifacts or implicit regularization.

    Authors: We agree that additional quantitative metrics and targeted ablations would further substantiate the claims. In the revised manuscript we have expanded §4 to report explicit prediction error metrics (PSNR, SSIM, LPIPS) for controllable forecasting across all datasets and introduced a disentanglement score that quantifies independent editability (background/foreground/motion) via optical-flow consistency and foreground segmentation overlap. We also include an ablation that replaces the INR weight-space state representation with a conventional latent vector of matched dimensionality while holding all other components fixed. The results show clear degradation in both forecasting accuracy and disentanglement quality, indicating that the observed properties are tied to the INR formulation rather than dataset artifacts or implicit regularization. revision: yes

  2. Referee: [§3.3] The distillation into a video generator via action-matching is presented as straightforward, yet the manuscript must clarify whether this step preserves the claimed disentanglement properties or introduces trade-offs; a direct comparison of editing fidelity before and after distillation would strengthen the portability argument.

    Authors: We thank the referee for this suggestion. The revised manuscript now includes a dedicated comparison in the experiments section that directly measures editing fidelity before and after distillation. Using the same set of structural edits (background change, foreground motion alteration, etc.), we evaluate both the original INR-based NOVA model and the distilled video generator with perceptual metrics (LPIPS, FID) and consistency checks (optical flow preservation). The results show that the core disentanglement properties are largely retained after distillation, with only modest trade-offs in fine-grained control attributable to the generative approximation. This addition clarifies the portability of the framework while acknowledging the small cost of distillation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents NOVA as a novel construction: system states are stored directly as weights/biases of an auxiliary INR, which is then analytically rendered to bypass decoders. No equations, derivations, or load-bearing steps are visible that reduce this representation, the claimed disentanglement, or the forecasting performance to fitted quantities defined by the same claims or to self-citations. The abstract and description frame the approach as an independent architectural choice whose benefits (compactness, zero-shot super-resolution, unsupervised structural separation) are asserted as empirical outcomes rather than tautological re-statements of inputs. No self-definitional loops, fitted-input predictions, or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that coordinate-based INRs can serve as compact, renderable, and disentangleable state representations for video dynamics. No free parameters or invented entities beyond the NOVA framework itself are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Coordinate-based implicit neural representations can faithfully encode and render dynamic scene states from their weights alone.
    Invoked when the paper states that the system state is represented as INR weights and biases that are analytically rendered.
invented entities (1)
  • NOVA framework no independent evidence
    purpose: World modeling via INR weight-space states with built-in structural disentanglement
    Newly proposed architecture and training procedure described in the abstract.

pith-pipeline@v0.9.0 · 5507 in / 1391 out tokens · 41023 ms · 2026-05-11T01:42:57.085892+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 6 internal anchors

  1. [1]

    Lamo: A latent motion world model for long-horizon prediction

    Azwar Abdulsalam, Christopher Hoang, and Mengye Ren. Lamo: A latent motion world model for long-horizon prediction. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling,

  2. [2]

    CoRR , volume =

    Loick Chambon, Paul Couairon, Eloi Zablocki, Alexandre Boulch, Nicolas Thome, and Matthieu Cord. Naf: Zero-shot feature upsampling via neighborhood attention filtering.arXiv preprint arXiv:2511.18452,

  3. [3]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    10 Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation.arXiv preprint arXiv:1406.1078,

  4. [4]

    From data to functa: Your data point is a function and you can treat it like one.arXiv preprint arXiv:2201.12204, 2022

    Emilien Dupont, Hyunjik Kim, SM Eslami, Danilo Rezende, and Dan Rosenbaum. From data to functa: Your data point is a function and you can treat it like one.arXiv preprint arXiv:2201.12204,

  5. [5]

    Essakine, Y

    Amer Essakine, Yanqi Cheng, Chun-Wun Cheng, Lipei Zhang, Zhongying Deng, Lei Zhu, Carola-Bibiane Schönlieb, and Angelica I Aviles-Rivero. Where do we stand with implicit neural representations? a technical and performance survey.arXiv preprint arXiv:2411.03688,

  6. [6]

    Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

    Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938,

  7. [7]

    arXiv preprint arXiv:2601.05230 (2026)

    Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230,

  8. [8]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396,

  9. [9]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122,

  10. [10]

    Pre-trained video generative models as world simulators

    URL https: //openreview.net/forum?id=S1lOTC4tDS. Haoran He, Yang Zhang, Liang Lin, Zhongwen Xu, and Ling Pan. Pre-trained video generative models as world simulators.arXiv preprint arXiv:2502.07825,

  11. [11]

    The era5 global reanalysis.Quarterly journal of the royal meteorological society, 146(730):1999–2049,

    Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, et al. The era5 global reanalysis.Quarterly journal of the royal meteorological society, 146(730):1999–2049,

  12. [12]

    How far is video generation from world model: A physical law perspective

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385,

  13. [13]

    Equinox: neural networks in JAX via callable PyTrees and filtered transformations.Differentiable Programming workshop at Neural Information Processing Systems 2021,

    11 Patrick Kidger and Cristian Garcia. Equinox: neural networks in JAX via callable PyTrees and filtered transformations.Differentiable Programming workshop at Neural Information Processing Systems 2021,

  14. [14]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  15. [15]

    Weightflow: Learning stochastic dynamics via evolving weight of neural network.arXiv preprint arXiv:2508.00451, 2025a

    Ruikun Li, Jiazhen Liu, Huandong Wang, Qingmin Liao, and Yong Li. Weightflow: Learning stochastic dynamics via evolving weight of neural network.arXiv preprint arXiv:2508.00451, 2025a. Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025b. Zhixuan Lin, Yi...

  16. [16]

    LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

    Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.https://arxiv.org/pdf/2603.19312v1,

  17. [17]

    Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588,

  18. [18]

    Universal differential equations for scientific machine learning.arXiv preprint arXiv:2001.04385, 2020

    Christopher Rackauckas, Yingbo Ma, Julius Martensen, Collin Warner, Kirill Zubov, Rohit Supekar, Dominic Skinner, Ali Ramadhan, and Alan Edelman. Universal differential equations for scientific machine learning. arXiv preprint arXiv:2001.04385,

  19. [19]

    Learning to act without actions.arXiv preprint arXiv:2312.10812, 2023

    Dominik Schmidt and Minqi Jiang. Learning to act without actions.arXiv preprint arXiv:2312.10812,

  20. [20]

    Towards scalable and versatile weight space learning.arXiv preprint arXiv:2406.09997,

    Konstantin Schürholt, Michael W Mahoney, and Damian Borth. Towards scalable and versatile weight space learning.arXiv preprint arXiv:2406.09997,

  21. [21]

    Sitzmann, J

    Vincent Sitzmann, Julien NP Martel, Alexander W Bergman, David B Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions.arXiv preprint arXiv:2006.09661,

  22. [22]

    A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

    URLhttps://arxiv.org/abs/2602.03604. SS Vallender. Calculation of the wasserstein distance between probability distributions on the line.Theory of Probability & Its Applications, 18(4):784–786,

  23. [23]

    Chain of World: World model thinking in latent motion.arXiv preprint arXiv:2603.03195, 2026

    Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, and Baorui Ma. Chain of world: World model thinking in latent motion.arXiv preprint arXiv:2603.03195,

  24. [24]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758,

  25. [25]

    What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025

    Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, and Jiang Bian. What do latent action models actually learn?arXiv preprint arXiv:2506.15691, 2025a. Xingyuan Zhang, Philip Becker-Ehmck, Patrick van der Smagt, and Maximilian Karl. Overcoming knowledge barriers: Online imitation learning from visual observation with pretr...

  26. [26]

    16 B.2 ARCHITECTURE& HYPERPARAMETERS

    13 Render, Don’t Decode: Weight-Space World Models with Latent Structural Disentanglement —Supplementary Material— A RELATEDWORK15 B METHODOLOGICALDETAILS16 B.1 MOTIVATION: INRS ANDTHEDECODERBOTTLENECK. . . . . . . . . . . . . . . 16 B.2 ARCHITECTURE& HYPERPARAMETERS. . . . . . . . . . . . . . . . . . . . . . . . 17 B.3 CONTEXT-CONDITIONEDVIDEOGENERATION....

  27. [27]

    LAPO [Schmidt and Jiang, 2023] and LAPA [Ye et al., 2024] learn inverse dynamics models that extract latent actions bridging consecutive states

    infer these from consecutive observations. LAPO [Schmidt and Jiang, 2023] and LAPA [Ye et al., 2024] learn inverse dynamics models that extract latent actions bridging consecutive states. Moto [Chen et al., 2025] and AdaWorld [Gao et al., 2025] incorporate such models into robot learning pipelines. Garrido et al

  28. [28]

    For deployment without future frames, RSSMs [Hafner et al., 2020], IRIS [Micheli et al., 2022], and CoWVLA [Yang et al., 2026] use autoregressive models to generate actions

    demonstrate that continuous latent actions, although sparse and noisy, capture for in-the-wild actions learning, whereas discrete codebooks struggle. For deployment without future frames, RSSMs [Hafner et al., 2020], IRIS [Micheli et al., 2022], and CoWVLA [Yang et al., 2026] use autoregressive models to generate actions. Our GCM follows this principle us...

  29. [29]

    In the 3D domain, 3D Gaussian Splatting [Kerbl et al., 2023] and work such as GWM [Lu et al., 2025] exploitstructuredspatial representations for world modelling

    introduce the term “functa”, showing that meta-learned [Zintgraf et al., 2019; Nzoyem et al., 2025] INR weight vectors can serve as data representations for downstream tasks. In the 3D domain, 3D Gaussian Splatting [Kerbl et al., 2023] and work such as GWM [Lu et al., 2025] exploitstructuredspatial representations for world modelling. Our work extends thi...

  30. [30]

    We set the values corresponding tou t as0in the input sequence

    that processes a sequence of concatenated {zτ ,u τ }1≤τ≤t pairs, with learned positional embeddings. We set the values corresponding tou t as0in the input sequence. 9Note that the output layer is included in the layer count; meaning that 6 layers corresponds to a depth of 5, following Equinox’s convention [Kidger and Garcia, 2021]. 17 (e) Forward Dynamics...

  31. [31]

    (1) Moving MNIST[Srivastava et al., 2015] assesses deterministic spatial-temporal forecasting and boundary sharpness over time; sequences (T=

  32. [32]

    We use 8,000 training and 2,000 testing sequences

    show two digits bouncing on a64×64 grayscale canvas. We use 8,000 training and 2,000 testing sequences. (2) PhyWorld 30K[Kang et al., 2024] evaluates modelling of complex, non-linear multi-body physical interactions. Rendered at 128×128 RGB resolution, we use exactly 26,066 training sequences and 1,635 testing sequences containing both in-distribution and...

  33. [33]

    Vector” describes the dimension of one-dimension latent vectors, while “Submodel

    We implement the Standard WM in similar conditions toNOV A, even matching its batch sizes. As for the encoder-decoder-free LAPO [Schmidt and Jiang, 2023], we adapt the code from its official repos- itory, readily available and downloaded fromhttps://github.com/schmidtdominik/LAPO. Table 5: Parameter count across datasets. “Vector” describes the dimension ...

  34. [34]

    As seen in Figure 11, we observe that ¯z naturally absorbs this, indicating that the per-frame offsets zt have near-zero contribution to background pixels

    D Additional Results D.1 Latent Structural Disentanglement Background absorption by ¯z.In all Moving MNIST sequences, the background is uniformly dark. As seen in Figure 11, we observe that ¯z naturally absorbs this, indicating that the per-frame offsets zt have near-zero contribution to background pixels. The utility of ¯z extends beyond structural disen...

  35. [35]

    prototype

    By framing the temporal predictions as small residuals zt relative to ¯z, analogous to residual learning [He et al., 2016], we stabilise optimisation and prevent the network from collapsing into local minima. Consequently, even in complex scenes where background disentanglement is challenging, the residual anchoring provided by ¯zmay remain beneficial for...

  36. [36]

    NOV A produces sharp predictions without checkerboard artefacts, a common failure mode in transposed- CNN decoders

    Figure 14 shows qualitative video forecasting results on Moving MNIST, conditioned on 10 initial frames and rolling out for 10 predicted frames. NOV A produces sharp predictions without checkerboard artefacts, a common failure mode in transposed- CNN decoders. Forecasting over longer horizons Tinf = 1000 Long-horizon generation is a potent test of identit...

  37. [37]

    We report the median, min, and max over T= 1000frames and over two sequences{54,57}, limited as such for computational reasons

    Table 7: Long-horizon identity-consistency metrics. We report the median, min, and max over T= 1000frames and over two sequences{54,57}, limited as such for computational reasons. Metric Method Median Min Max W1 ↓Standard WM 0.0780 0.0462 0.1308 LAPO 0.0110 0.0037 0.0847 NOV A 0.1222 0.0258 0.6138 JSD↓Standard WM 0.4255 0.3313 0.4966 LAPO 0.0771 0.0703 0....

  38. [38]

    from in-band ones at the training sample locations

    Injected alien collision dynamics (actions) are the same, leading to unphysical behaviours. from in-band ones at the training sample locations. 12 The loss cannot differentiate them, so the corresponding weights are optimised against noise and produce artefacts at super-resolution. Theorem 1(Nyquist–Shannon Sampling Theorem (Informal)).If a continuous fun...

  39. [39]

    Figure 23:Latent action dimensions on MiniGrid.(Left)Discrete: raw actions and their quantised versions

    ×0.5 (16×32) Nearest Neighbour Bilinear NOVA (ours) ×4 (128×256) ×32 (1024×2048) −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 T emperature Figure 22:Zero-shot super-resolution on WeatherBench without frequency masking.Compared to Figure 7, horizontal bars are visible, indicating spectral aliasing along the vertical axis. Figure 23:Latent action dimensions on MiniGrid....

  40. [40]

    Both frame- works view the latent state of the dynamic system not as a standard feature vector, but as the weights and biases of an INR

    shares several similarities with contemporary architectures, most notably Weight-space Adaptive Recurrent Prediction (WARP) [Nzoyem et al., 2026]. Both frame- works view the latent state of the dynamic system not as a standard feature vector, but as the weights and biases of an INR. In WARP, this is governed by a continuous linear recurrence defined as zt...

  41. [41]

    go to x, y

    Following the processing of the initial 3 frames and autoregressive generation after those, we intervened at t= 6 by replacing the naturally evolved statez t with the latent encoding of an alien frame. However, the intervention sequence (Figure 27b) demonstrates a catastrophic failure, as W ARP allows the alien injection to fundamentally corrupt the forwa...