pith. sign in

arxiv: 2605.18743 · v1 · pith:H4PHB7T6new · submitted 2026-05-18 · 💻 cs.AI

Actionable World Representation

Pith reviewed 2026-05-20 09:43 UTC · model grok-4.3

classification 💻 cs.AI
keywords world modelsobject representationstate manifoldneural architecturepoint cloudsRGB-D videodigital twinactionable objects
0
0 comments X p. Extension
pith:H4PHB7T6 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{H4PHB7T6}

Prints a linked pith:H4PHB7T6 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

WorldString learns the state manifold of real-world objects directly from point clouds or RGB-D video to serve as a digital twin.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes WorldString as a neural architecture that models how real objects change their states over time. It learns this manifold straight from raw sensor inputs like point clouds or depth videos, without needing extra labels or custom engineering for each object. This creates a unified representation that can act as a building block for larger physical world models. A sympathetic reader would care because it aims to give AI systems a way to handle actionable objects the way language models handle text.

Core claim

WorldString is a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models. Its fully differentiable structure enables future integration with policy learning and neural dynamics.

What carries the argument

WorldString, a fully differentiable neural architecture that recovers a low-dimensional state manifold for objects from raw sensor streams.

If this is right

  • The model can serve as a foundational component inside larger physical world models.
  • Its differentiability allows direct connection to policy learning and neural dynamics modules.
  • Objects become digital twins that encode intrinsic properties and state changes from sensor data alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This style of representation might improve long-horizon planning in robotics by giving agents explicit access to object state manifolds.
  • If the manifold is truly low-dimensional and general, similar architectures could be applied to non-rigid or articulated objects without redesign.

Load-bearing premise

Real-world objects have a learnable low-dimensional state manifold that can be recovered in a unified way from raw sensor streams without extra supervision or object-specific design.

What would settle it

A test in which WorldString fails to produce consistent state predictions for novel objects or actions outside its training distribution would show the manifold is not recoverable in the claimed unified manner.

read the original abstract

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes WorldString, a neural architecture that models the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. It is positioned as a versatile digital twin and foundational building block for physical world models, with a fully differentiable structure intended to support future integration with policy learning and neural dynamics.

Significance. If the architecture can recover a unified, low-dimensional, actionable state manifold from raw sensor data without supervision or object-specific engineering, it would address a gap between video-generation approaches and dynamic scene reconstruction by providing an explicit, general object representation. This could serve as a reusable primitive for physical world models in robotics and embodied AI. The manuscript, however, offers only a high-level proposal with no architecture details, loss formulation, or empirical results, so the significance remains speculative.

major comments (2)
  1. [Abstract] Abstract: The central claim that WorldString 'models the state manifold of real-world objects ... in a unified, principled way' is load-bearing yet unsupported; the manuscript provides neither an architecture diagram, loss function, nor training procedure to demonstrate how a single network extracts intrinsic states across rigid, articulated, and deformable objects from raw point clouds or RGB-D streams.
  2. [Abstract] Abstract: The assertion that the representation is 'actionable' and serves as a 'digital twin' for downstream policy learning and neural dynamics rests on the untested assumption that the learned manifold captures causally relevant intrinsic properties rather than superficial geometric or appearance correlations; no ablation or generalization experiment is described to substantiate this.
minor comments (1)
  1. [Abstract] Abstract, final sentence: The adverb 'Sweetly' is informal and imprecise; replace with a clearer term such as 'Importantly' or 'Advantageously'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and describe the changes we will make in revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that WorldString 'models the state manifold of real-world objects ... in a unified, principled way' is load-bearing yet unsupported; the manuscript provides neither an architecture diagram, loss function, nor training procedure to demonstrate how a single network extracts intrinsic states across rigid, articulated, and deformable objects from raw point clouds or RGB-D streams.

    Authors: We agree that the current manuscript presents WorldString at a conceptual level and does not yet supply the requested technical details. In the revised version we will add an architecture diagram, the explicit loss formulation, and a description of the training procedure. These additions will show how a single network is intended to recover intrinsic states from raw sensor data across rigid, articulated, and deformable objects. revision: yes

  2. Referee: [Abstract] Abstract: The assertion that the representation is 'actionable' and serves as a 'digital twin' for downstream policy learning and neural dynamics rests on the untested assumption that the learned manifold captures causally relevant intrinsic properties rather than superficial geometric or appearance correlations; no ablation or generalization experiment is described to substantiate this.

    Authors: We acknowledge that the manuscript currently offers no empirical results or ablations to support the actionability claim. We will revise the abstract and introduction to clarify that actionability is currently a design property arising from the fully differentiable state-manifold representation, rather than a demonstrated causal property. We will also add a section that outlines concrete validation experiments, ablations, and generalization tests planned for follow-up work. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper proposes WorldString as a neural architecture that models the state manifold of objects directly from point clouds or RGB-D streams to serve as a digital twin for physical world models. The provided text contains no equations, loss formulations, parameter-fitting procedures, or derivation steps that could be inspected for reduction to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The central premise is an assumption about the existence of a learnable low-dimensional manifold, presented as a hypothesis rather than a result derived from prior self-referential content. The architecture is described at a high level without any mathematical chain that collapses to its own fitted values or definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; no specific free parameters, axioms, or invented entities beyond the named model are described in the provided text.

invented entities (1)
  • WorldString no independent evidence
    purpose: neural architecture for modeling object state manifolds
    New model name and claimed capability introduced in the abstract

pith-pipeline@v0.9.0 · 5708 in / 950 out tokens · 21078 ms · 2026-05-20T09:43:26.482124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Cosmos world foundation model platform for physical ai. Technical report, NVIDIA, 2025. Technical report; available as arXiv:2501.03575

  2. [2]

    ROS 2 Documentation and ROS Wiki, 2026

    Urdf (unified robot description format). ROS 2 Documentation and ROS Wiki, 2026. https://docs.ros.org/en/humble/Tutorials/Intermediate/URDF/URDF-Main.html andhttps://wiki.ros.org/urdf/XML/model(accessed: 2026-03-04)

  3. [3]

    Akenine-Möller, E

    T. Akenine-Möller, E. Haines, N. Hoffman, A. Pesce, M. Iwanicki, and S. Hillaire.Real-Time Rendering. Taylor & Francis, 4th edition, 2018. ISBN 978-1-138-62700-0

  4. [4]

    E.Aljalbout, J.Xing, A.Romero, I.Akinola, C.R.Garrett, E.Heiden, A.Gupta, T.Hermans, Y.Narang, 14 D. Fox, D. Scaramuzza, and F. Ramos. The reality gap in robotics: Challenges, solutions, and best practices, 2025. URLhttps://arxiv.org/abs/2510.20808

  5. [5]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation, 2024. URLhttps://arxiv. org/abs/2406.03520

  6. [6]

    Bathe.Finite Element Procedures

    K.-J. Bathe.Finite Element Procedures. K. J. Bathe, Watertown, MA, second edition edition, 2014

  7. [7]

    J. F. Blinn and M. E. Newell. Texture and reflection in computer generated images.Commun. ACM, 19(10):542–547, Oct. 1976. ISSN 0001-0782. doi: 10.1145/360349.360353. URLhttps: //doi.org/10.1145/360349.360353

  8. [8]

    Bloomenthal and C

    J. Bloomenthal and C. Bajaj, editors.Introduction to Implicit Surfaces. Morgan Kaufmann, 1997. ISBN 1-55860-233-X

  9. [9]

    Bonet and R

    J. Bonet and R. D. Wood.Nonlinear Continuum Mechanics for Finite Element Analysis. Cambridge University Press, 2nd edition, 2008

  10. [10]

    Botsch, L

    M. Botsch, L. Kobbelt, M. Pauly, P. Alliez, and B. Lévy.Polygon Mesh Processing. A K Peters, Natick, 2010

  11. [11]

    Bruce, M

    J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. M. E. Bechtle, F. Behbahani, S. C. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. D. Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments. In R. Sa...

  12. [12]

    Bruce, M

    J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. M. E. Bechtle, F. Behbahani, S. C. Y. Chan, N. Heess, L. Gonzalez, S.Osindero, S.Ozair, S.Reed, J.Zhang, K.Zolna, J.Clune, N.DeFreitas, S.Singh, andT.Rocktäschel. Genie: Generative interactive environments. InProceedings of...

  13. [13]

    X. Chen, Y. Zheng, M. J. Black, O. Hilliges, and A. Geiger. Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  14. [14]

    Y. Chen, Z. Chen, C. Zhang, F. Wang, X. Yang, Y. Wang, Z. Cai, L. Yang, H. Liu, and G. Lin. Gaussianeditor: Swift and controllable 3d editing with gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  15. [15]

    Edstedt, Q

    J. Edstedt, Q. Sun, G. Bökman, M. Wadenbäck, and M. Felsberg. RoMa: Robust Dense Feature Matching. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

  16. [16]

    Erleben, J

    K. Erleben, J. Sporring, K. Henriksen, and H. Dohlmann.Physics-based Animation. Charles River Media, Hingham, Mass., 2005. ISBN 1-58450-380-7. 15

  17. [17]

    Gross and H

    M. Gross and H. Pfister, editors.Point-Based Graphics. Morgan Kaufmann, 2007. ISBN 978-0-12- 370604-1

  18. [18]

    Ha and J

    D. Ha and J. Schmidhuber. World models, 2018

  19. [19]

    Mastering Diverse Domains through World Models

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. URLhttps://arxiv.org/abs/2301.04104

  20. [20]

    Hafner, J

    D. Hafner, J. Pasukonis, J. Ba, and T. P. Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

  21. [21]

    Huang, J

    S. Huang, J. Wu, Q. Zhou, S. Miao, and M. Long. Vid2world: Crafting video diffusion models to interactive world models, 2025. URLhttps://arxiv.org/abs/2505.14357

  22. [22]

    Huang, Y.-T

    Y.-H. Huang, Y.-T. Sun, Z. Yang, X. Lyu, Y.-P. Cao, and X. Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  23. [23]

    J. F. Hughes, A. van Dam, M. McGuire, D. F. Sklar, J. D. Foley, S. K. Feiner, and K. Akeley.Computer Graphics: Principles and Practice. Addison-Wesley, 3rd edition, 2014. ISBN 978-0-321-39952-6

  24. [24]

    Jiang, H.-Y

    H. Jiang, H.-Y. Hsu, K. Zhang, H.-N. Yu, S. Wang, and Y. Li. Phystwin: Physics-informed reconstruc- tion and simulation of deformable objects from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  25. [25]

    Jiang, H.-Y

    H. Jiang, H.-Y. Hsu, K. Zhang, H.-N. Yu, S. Wang, and Y. Li. Phystwin: Physics-informed reconstruc- tion and simulation of deformable objects from videos, 2025. URLhttps://arxiv.org/abs/ 2503.17973

  26. [26]

    Karaev, I

    N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. InProc. arXiv:2410.11831, 2024

  27. [27]

    Karunratanakul, A

    K. Karunratanakul, A. Spurr, Z. Fan, O. Hilliges, and S. Tang. A skeleton-driven neural occupancy representation for articulated hands. InInternational Conference on 3D Vision (3DV), 2021

  28. [28]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 2023

  29. [29]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam. inria.fr/fungraph/3d-gaussian-splatting/

  30. [30]

    Krishnamurthy and M

    V. Krishnamurthy and M. Levoy. Fitting smooth surfaces to dense polygon meshes. InProceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, page 313–324, New York, NY, USA, 1996. Association for Computing Machinery. ISBN 0897917464. doi: 10.1145/237170.237270. URLhttps://doi.org/10.1145/237170.237270

  31. [31]

    Levoy, K

    M. Levoy, K. Pulli, B. Curless, S. Rusinkiewicz, D. Koller, L. Pereira, M. Ginzton, S. Anderson, J. Davis, J. Ginsberg, J. Shade, and D. Fulk. The digital michelangelo project: 3d scanning of large statues. InProceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, page 131–144, USA, 2000. ACM Press/Addison-...

  32. [32]

    R. Liu, A. Canberk, S. Song, and C. Vondrick. Differentiable robot rendering, 2024. URLhttps: //arxiv.org/abs/2410.13851

  33. [33]

    R. Liu, A. Canberk, S. Song, and C. Vondrick. Differentiable robot rendering. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 117–129. PMLR, 06–09 Nov 2025. URL https://proceedings.mlr.press/v270/liu25a.html

  34. [34]

    Y.-L. Liu, C. Gao, A. Meuleman, H.-Y. Tseng, A. Saraf, C. Kim, Y.-Y. Chuang, J. Kopf, and J.-B. Huang. Robust dynamic radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  35. [35]

    X. Long, Q. Zhao, K. Zhang, Z. Zhang, D. Wang, Y. Liu, Z. Shu, Y. Lu, S. Wang, X. Wei, W. Li, W. Yin, Y. Yao, J. Pan, Q. Shen, R. Yang, X. Cao, and Q. Dai. A survey: Learning embodied intelligence from physical simulators and world models, 2025. URLhttps://arxiv.org/abs/2507.00917

  36. [36]

    Loper, N

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015

  37. [37]

    G. Lu, B. Jia, P. Li, Y. Chen, Z. Wang, Y. Tang, and S. Huang. Gwm: Towards scalable gaussian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9263–9274, October 2025

  38. [38]

    Luiten, G

    J. Luiten, G. Kopanas, B. Leibe, and D. Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. InInternational Conference on 3D Vision (3DV), 2024

  39. [39]

    K. M. Lynch and F. C. Park.Modern Robotics: Mechanics, Planning, and Control. Cambridge University Press, 2017. ISBN 978-1-108-50969-5

  40. [40]

    Mildenhall, P

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Rep- resenting scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1): 99–106, 2021

  41. [41]

    Parent.Computer Animation: Algorithms and Techniques

    R. Parent.Computer Animation: Algorithms and Techniques. Morgan Kaufmann, 3rd edition, 2012

  42. [42]

    S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  43. [43]

    Pumarola, E

    A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  44. [44]

    T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, Z. Zeng, H. Zhang, F. Li, J. Yang, H. Li, Q. Jiang, and L. Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024. 17

  45. [45]

    Sakagami, F

    R. Sakagami, F. S. Lay, A. Dömel, M. J. Schuster, A. Albu-Schäffer, and F. Stulp. Robotic world models—conceptualization, review, and engineering best practices.Frontiers in Robotics and AI, 10,

  46. [46]

    URLhttps://www.frontiersin.org/journals/ robotics-and-ai/articles/10.3389/frobt.2023.1253049/full

    doi: 10.3389/frobt.2023.1253049. URLhttps://www.frontiersin.org/journals/ robotics-and-ai/articles/10.3389/frobt.2023.1253049/full

  47. [47]

    M. R. Samsami, A. Zholus, J. Rajendran, and S. Chandar. Mastering memory tasks with world models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=1vDArHJ68h

  48. [48]

    M. W. Spong, S. Hutchinson, and M. Vidyasagar.Robot Modeling and Control. John Wiley & Sons, 2006

  49. [49]

    J. Tang, M. Lev, W. Bi, T. Justus, and M. Nießner. Neural shape deformation priors. InAdvances in Neural Information Processing Systems, 2022

  50. [50]

    Turk and M

    G. Turk and M. Levoy. Zippered polygon meshes from range images. InProceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’94, page 311–318, New York, NY, USA, 1994. Association for Computing Machinery. ISBN 0897916670. doi: 10.1145/ 192161.192241. URLhttps://doi.org/10.1145/192161.192241

  51. [51]

    G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and X. Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  52. [52]

    Structured 3D Latents for Scalable and Versatile 3D Generation

    J. Xiang, Z. Lv, S. Xu, Y. Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024

  53. [53]

    Q. Xu, J. Liu, S. Yu, Y. Wang, Y. Zhou, J. Zhou, J. Cui, Y.-S. Ong, and H. Zhang. Neuspring: Neural spring fields for reconstruction and simulation of deformable objects from videos. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026. arXiv:2511.08310

  54. [54]

    W. Xu, H. Fu, H. Dong, Z. Zhou, and C. Chen. Deal: Diffusion evolution adversarial learning for sim-to-real transfer. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=284GWLFtjU. Poster

  55. [55]

    X. Yang, Z. Ji, and Y.-K. Lai. Differentiable physics-based system identification for robotic manipula- tion of elastoplastic materials, 2024. URLhttps://arxiv.org/abs/2411.00554

  56. [56]

    H.-X. Yu, H. Duan, C. Herrmann, W. T. Freeman, and J. Wu. Wonderworld: Interactive 3d scene generation from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5916–5926, June 2025

  57. [57]

    Zhang, D

    C. Zhang, D. Cherniavskii, A. Tragoudaras, A. Vozikis, T. Nijdam, D. W. E. Prinzhorn, M. Bodracska, N. Sebe, A. Zadaianchuk, and E. Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments, 2025. URLhttps://arxiv.org/abs/2504. 02918

  58. [58]

    Zheng, Z

    J. Zheng, Z. Zhu, V. Bieri, M. Pollefeys, S. Peng, and I. Armeni. Wildgs-slam: Monocular gaussian splatting slam in dynamic environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11461–11471, June 2025. 18

  59. [59]

    O. C. Zienkiewicz, R. L. Taylor, and D. D. Fox.The Finite Element Method for Solid and Structural Mechanics. Elsevier/Butterworth-Heinemann, Amsterdam, 7th edition, 2014

  60. [60]

    Zuffi, A

    S. Zuffi, A. Kanazawa, D. Jacobs, and M. J. Black. 3D menagerie: Modeling the 3D shape and pose of animals. InIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), July 2017. 19