pith. the verified trust layer for science. sign in

arxiv: 2510.04978 · v5 · submitted 2025-10-06 · 💻 cs.AI

Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

Pith reviewed 2026-05-18 09:53 UTC · model grok-4.3

classification 💻 cs.AI
keywords physical AIembodied intelligenceworld modelsphysics-grounded methodssymbolic reasoninggenerative modelsAI survey
0
0 comments X p. Extension

The pith

This survey argues that aligning perception, reasoning, modeling and interaction with physical laws lets AI move beyond pattern matching to genuine physical understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys efforts to integrate physical principles into AI systems that handle perception, reasoning, modeling and interaction. It draws a clear line between theoretical physics reasoning and the applied physical understanding needed for real-world tasks. The authors review recent progress in symbolic reasoning, embodied systems and generative models, then make the case for systems that combine physical principles with embodied reasoning processes. If this integration succeeds, AI would no longer rely mainly on statistical patterns but could explain physical phenomena and predict future states. The stated goal is safer, more generalizable and interpretable world models for embodied intelligence.

Core claim

Intelligent systems that ground learning in both physical principles and embodied reasoning processes can transcend pattern recognition toward genuine understanding of physical laws, enabling next-generation world models capable of explaining physical phenomena and predicting future states.

What carries the argument

A unified bridging framework that connects structured symbolic reasoning, embodied systems and generative models through physics-grounded methods to produce applied physical understanding.

If this is right

  • AI gains improved real-world comprehension by grounding outputs in physical laws.
  • World models become able to explain observed phenomena and forecast future physical states.
  • Systems advance toward greater safety, generalization and interpretability in embodied tasks.
  • Perception, reasoning, modeling and interaction become mutually reinforcing rather than separate tracks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotics applications could gain more reliable prediction of object interactions and dynamics.
  • Training data requirements might decrease if physical constraints replace some statistical learning.
  • The survey's distinctions between theoretical and applied understanding could guide evaluation benchmarks in cognitive robotics.

Load-bearing premise

Recent advances in physics-grounded methods across symbolic reasoning, embodied systems and generative models can be brought together in one framework that yields genuine physical understanding rather than just stronger pattern matching.

What would settle it

Build and test a unified physical AI system on novel physical scenarios; if its generalization and explanatory power remain no better than current pattern-based models, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2510.04978 by Bin Dong, Bingqian Lin, Hang Xu, Hanhui Li, Jianhua Han, Jixi He, Kun Xiang, Lijing Luo, Ruizhe Zhou, Terry Jingchen Zhang, Xiaodan Liang, Xiuwei Chen, Yinya Huang, Youpeng Wen, Yueling Tang, Zirong Liu.

Figure 1
Figure 1. Figure 1: Overview of four physical understanding capabilities of current AI systems. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Timeline of the development of Physical AI [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The proposed taxonomy of AI systems for physics understanding capabilities, organized into four [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Error pattern analysis of closed-Source models in multimodal physical reasoning. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-agent physics reasoning system across [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Video Generations by GPT4Motion (Figure courtesy of [163]). TABLE 3: Performance of Video Generation Mod￾els on representative physical and world modeling benchmarks. Model PhysicsIQ [164](↑) PhyGen [165](↑) VideoPhy [166] (↑) WorldModel Bench [167](↑) Sora [168] 0.10 0.44 0.28 6.11 Pika [169] 0.13 0.44 0.29 – CogVideoX [170] – 0.45 0.49 7.31 LaVie [171] – 0.36 0.41 – Kling [172] – 0.49 – 8.82 Modern video… view at source ↗
Figure 8
Figure 8. Figure 8: Overview of NavCoT (Figure used courtesy [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overall framework of DriveDreamer (Figure [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

The rapid advancement of embodied intelligence and world models has intensified efforts to integrate physical laws into AI systems, yet physical perception and symbolic physics reasoning have developed along separate trajectories without a unified bridging framework. This work provides a comprehensive overview of physical AI, establishing clear distinctions between theoretical physics reasoning and applied physical understanding while systematically examining how physics-grounded methods enhance AI's real-world comprehension across structured symbolic reasoning, embodied systems, and generative models. Through rigorous analysis of recent advances, we advocate for intelligent systems that ground learning in both physical principles and embodied reasoning processes, transcending pattern recognition toward genuine understanding of physical laws. Our synthesis envisions next-generation world models capable of explaining physical phenomena and predicting future states, advancing safe, generalizable, and interpretable AI systems. We maintain a continuously updated resource at https://github.com/AI4Phys/Awesome-AI-for-Physics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a survey on Physical AI that reviews the integration of physical laws into AI systems. It distinguishes between theoretical physics reasoning and applied physical understanding, examines physics-grounded methods in symbolic reasoning, embodied systems, and generative models, and advocates for grounding learning in physical principles and embodied reasoning to achieve genuine understanding beyond pattern recognition, envisioning advanced world models for explaining and predicting physical phenomena.

Significance. If the advocated synthesis holds, this survey could significantly influence the field by promoting more interpretable and generalizable AI systems that incorporate physical understanding, potentially leading to safer and more robust embodied intelligence and world models. The continuous resource at the GitHub link adds value for the community.

major comments (2)
  1. [Abstract] Abstract: The positioning of the work as providing a 'unified bridging framework' for transcending pattern recognition is load-bearing for the central claim, yet the review of separate trajectories across symbolic reasoning, embodied systems, and generative models does not include a concrete integration mechanism or formal definition of 'genuine understanding' versus improved statistical correlation.
  2. [Synthesis sections] Synthesis/Advocacy sections: The distinction between theoretical and applied physical understanding is presented without external benchmarks or comparative evaluations on physical prediction tasks, which undermines the assertion that reviewed methods achieve transcendence beyond pattern matching.
minor comments (2)
  1. Update the GitHub resource link with the latest references to maintain currency in this rapidly evolving field.
  2. Ensure figure captions and tables clearly indicate the scope of reviewed methods to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our survey manuscript. We address each major comment point by point below, clarifying the scope of a survey paper while indicating specific revisions that will strengthen the presentation without altering its core contribution as a synthesis of the literature.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The positioning of the work as providing a 'unified bridging framework' for transcending pattern recognition is load-bearing for the central claim, yet the review of separate trajectories across symbolic reasoning, embodied systems, and generative models does not include a concrete integration mechanism or formal definition of 'genuine understanding' versus improved statistical correlation.

    Authors: As a survey, the manuscript does not introduce a new technical integration mechanism; the 'unified bridging framework' is intended as the organizational taxonomy and cross-domain analysis that connects the reviewed trajectories through shared physical principles. We will revise the abstract to describe this more precisely as a conceptual synthesis and organizational structure. We will also add a short subsection early in the introduction that offers a working definition of 'genuine understanding' in physical AI, drawing on distinctions from the literature such as causal intervention, counterfactual reasoning, and systematic generalization on physical tasks, to differentiate it from improved statistical correlation. revision: partial

  2. Referee: [Synthesis sections] Synthesis/Advocacy sections: The distinction between theoretical and applied physical understanding is presented without external benchmarks or comparative evaluations on physical prediction tasks, which undermines the assertion that reviewed methods achieve transcendence beyond pattern matching.

    Authors: The referee is correct that the current draft presents the distinction conceptually without new empirical comparisons. Because this is a survey, we do not conduct fresh experiments. In revision we will expand the synthesis sections to reference existing physical prediction benchmarks and datasets from the literature (e.g., those appearing in physics-informed neural network evaluations and embodied reasoning challenges), summarize performance trends reported in the cited works, and explicitly note the limitations of a review format in providing direct head-to-head evaluations. This will make the scope and evidential basis clearer while preserving the survey's role. revision: yes

Circularity Check

0 steps flagged

Survey with no derivations or self-referential predictions; claims rest on external citations

full rationale

This is a literature survey without equations, fitted parameters, or original derivations. The central advocacy for grounding AI in physical principles plus embodied reasoning and for next-generation world models is presented as a synthesis of reviewed advances across symbolic reasoning, embodied systems, and generative models. No load-bearing step reduces by construction to a self-definition, a fitted input renamed as prediction, or a self-citation chain; distinctions between theoretical and applied understanding are offered as organizational framing rather than a derived result. The paper therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the premise that separate development trajectories of perception and symbolic reasoning require a unified bridging framework; no free parameters or invented entities are introduced as this is a review rather than a modeling paper.

axioms (1)
  • domain assumption Physics-grounded methods can enhance AI's real-world comprehension beyond pattern recognition
    Invoked in the abstract when advocating for systems that ground learning in physical principles and embodied reasoning.

pith-pipeline@v0.9.0 · 5730 in / 1267 out tokens · 24953 ms · 2026-05-18T09:53:36.279346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 21 internal anchors

  1. [1]

    World models,

    D. Ha and J. Schmidhuber, “World models,”arXiv, 2018

  2. [2]

    AI meets physics: a comprehensive survey,

    L. Jiao, X. Song, C. Youet al., “AI meets physics: a comprehensive survey,”Artif. Intell. Rev., vol. 57, 2024

  3. [3]

    Newtonian Scene Understanding: Unfolding the Dy- namics of Objects in Static Images,

    R. Mottaghi, H. Bagherinezhad, and M. e. a. Rastegari, “Newtonian Scene Understanding: Unfolding the Dy- namics of Objects in Static Images,” inCVPR, 2016

  4. [4]

    Interaction Networks for Learning about Objects, Relations and Physics,

    P . W. Battaglia, R. Pascanu, M. Laiet al., “Interaction Networks for Learning about Objects, Relations and Physics,” inNeurIPS, vol. 29, 2016

  5. [5]

    SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning,

    K. Xiang, H. Li, T. J. Zhanget al., “SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning,”arXiv:2505.19099, 2025

  6. [6]

    I-PHYRE: Interactive Physical Reasoning,

    S. Li, K. Wu, C. Zhanget al., “I-PHYRE: Interactive Physical Reasoning,” inICLR, 2024. 15

  7. [7]

    PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly,

    L. Ma, J. Wen, M. Linet al., “PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly,” inNeurIPS, 2025

  8. [8]

    LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

    A. Cherian, R. Corcodel, S. Jainet al., “LLMPhy: Com- plex Physical Reasoning Using Large Language Models and World Models,”arXiv:2411.08027, 2024

  9. [9]

    ComPhy: Compositional Physical Reasoning of Objects and Events from Videos,

    Z. Chen, K. Yi, Y. Liet al., “ComPhy: Compositional Physical Reasoning of Objects and Events from Videos,” inICLR, 2022

  10. [10]

    Semi-supervised classifica- tion with graph convolutional networks,

    T. N. Kipf and M. Welling, “Semi-supervised classifica- tion with graph convolutional networks,” inICLR, 2017

  11. [11]

    Graph Attention Networks,

    P . Veliˇ ckovi´ c, G. Cucurull, A. Casanova, and et al., “Graph Attention Networks,” inICLR, 2018

  12. [12]

    Inductive Representation Learning on Large Graphs,

    W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive Representation Learning on Large Graphs,” inNeurIPS, vol. 30, 2017

  13. [13]

    Visual In- teraction Networks: Learning a Physics Simulator from Video,

    N. Watters, D. Zoran, T. Weber, and et al., “Visual In- teraction Networks: Learning a Physics Simulator from Video,” inNeurIPS, vol. 30, 2017

  14. [14]

    A Com- positional Object-Based Approach to Learning Physical Dynamics,

    M. B. Chang, T. D. Ullman, A. Torralbaet al., “A Com- positional Object-Based Approach to Learning Physical Dynamics,” inICLR, 2017

  15. [15]

    Motion- Craft: Physics-Based Zero-Shot Video Generation,

    A. Montanaro, L. Savant Aira, E. Aielloet al., “Motion- Craft: Physics-Based Zero-Shot Video Generation,” in NeurIPS, vol. 37, 2024

  16. [16]

    Videorepa: Learning physics for video generation through relational alignment with foundation models

    X. Zhang, J. Liao, S. Zhanget al., “VideoREPA: Learning Physics for Video Generation through Relational Align- ment with Foundation Models,”arXiv:2505.23656, 2025

  17. [17]

    How Do Transformers Model Physics? Investigating the Simple Harmonic Oscillator,

    S. Kantamneni, Z. Liu, and M. Tegmark, “How Do Transformers Model Physics? Investigating the Simple Harmonic Oscillator,”Entropy, vol. 26, 2024

  18. [18]

    A Physics-guided Multimodal Transformer Path to Weather and Climate Sciences,

    J. Han, H. Chen, K. Hanet al., “A Physics-guided Multimodal Transformer Path to Weather and Climate Sciences,”CoRR, 2025

  19. [19]

    Solving fluid flow problems using semi-supervised symbolic regression on sparse data,

    Y. M. F. El Hasadi and J. T. Padding, “Solving fluid flow problems using semi-supervised symbolic regression on sparse data,”AIP Adv., vol. 9, 2019

  20. [20]

    MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of- Thought Reasoning,

    X. Chen, R. Zhang, D. Jianget al., “MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of- Thought Reasoning,”arXiv:2506.05331, 2025

  21. [21]

    AlphaDrive: Un- leashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning,

    B. Jiang, S. Chen, Q. Zhanget al., “AlphaDrive: Un- leashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning,”arXiv, 2025

  22. [22]

    AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

    Z. Yuan, J. Tang, J. Luoet al., “AutoDrive-R 2: Incen- tivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving,”arXiv:2509.01944, 2025

  23. [23]

    Reinforcement learning meets large language models: A survey of advancements and applications across the llm lifecycle,

    K. Liu, D. Yang, Z. Qian, W. Yin, Y. Wang, H. Li, J. Liu, P . Zhai, Y. Liu, and L. Zhang, “Reinforcement learning meets large language models: A survey of advancements and applications across the llm lifecycle,” 2025. [Online]. Available: https: //arxiv.org/abs/2509.16679

  24. [24]

    Causal Modeling of Dynamical Systems,

    S. Bongers, T. Blom, and J. M. Mooij, “Causal Modeling of Dynamical Systems,”arXiv, 2018

  25. [25]

    Using Causal Threads to Explain Changes in a Dynamic System,

    R. B. Allen, “Using Causal Threads to Explain Changes in a Dynamic System,” inICADL, vol. 14458, 2023

  26. [26]

    PhysORD: a neuro-symbolic approach for physics-infused motion prediction in off- road driving,

    Z. Zhao, B. Li, Y. Duet al., “PhysORD: a neuro-symbolic approach for physics-infused motion prediction in off- road driving,” inIROS, 2024

  27. [27]

    Functional optimiza- tion of fluidic devices with differentiable stokes flow,

    T. Du, K. Wu, A. Spielberget al., “Functional optimiza- tion of fluidic devices with differentiable stokes flow,” ACM Trans. Graph., vol. 39, 2020

  28. [28]

    Scalable Differen- tiable Physics for Learning and Control,

    Y.-L. Qiao, J. Liang, V . Koltunet al., “Scalable Differen- tiable Physics for Learning and Control,” inICML, vol. 119, 2020

  29. [29]

    GPT-4o System Card

    OpenAI, Aaron Hurst, Adam Lerer, and et al., “GPT-4o System Card,”arXiv preprint arXiv:2410.21276, 2024

  30. [30]

    Visual Instruction Tuning,

    H. Liu, C. Li, Q. Wuet al., “Visual Instruction Tuning,” inNeurIPS, vol. 36, 2023

  31. [31]

    OpenAI o1 System Card,

    OpenAI, “OpenAI o1 System Card,”arXiv, 2024

  32. [32]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liuet al., “Qwen2.5-VL Technical Report,”arXiv:2502.13923, 2025

  33. [33]

    Claude 3.7 Sonnet System Card,

    Anthropic, “Claude 3.7 Sonnet System Card,” 2025

  34. [34]

    Gemini 2.5 Pro Model Card,

    Google DeepMind, “Gemini 2.5 Pro Model Card,” 2025

  35. [35]

    Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge,

    H. Liang, R. Wu, B. Zenget al., “Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge,”arXiv:2509.06079, 2025

  36. [36]

    GAIA-1: A Generative World Model for Autonomous Driving,

    A. Hu, L. Russell, H. Yeoet al., “GAIA-1: A Generative World Model for Autonomous Driving,”arXiv, 2023

  37. [37]

    DriveDreamer: Towards Real-world-driven World Models for Au- tonomous Driving,

    X. Wang, Z. Zhu, G. Huanget al., “DriveDreamer: Towards Real-world-driven World Models for Au- tonomous Driving,” inECCV, vol. 15106, 2024

  38. [38]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamchetiet al., “Open- VLA: An Open-Source Vision-Language-Action Model,” arXiv:2406.09246, 2024

  39. [39]

    Pi0: A vision- language-action flow model for general robot control,

    K. Black, N. Brown, D. Driesset al., “Pi0: A vision- language-action flow model for general robot control,” inRSS, 2025

  40. [40]

    A survey on multimodal large language models,

    S. Yin, C. Fu, S. Zhaoet al., “A survey on multimodal large language models,”Natl. Sci. Rev., vol. 11, 2024

  41. [41]

    Large lan- guage models predict human sensory judgments across six modalities,

    R. Marjieh, I. Sucholutsky, P . van Rijnet al., “Large lan- guage models predict human sensory judgments across six modalities,”Sci. Rep., vol. 14, 2024

  42. [42]

    Object detection with mul- timodal large vision-language models: An in-depth re- view,

    R. Sapkota and M. Karkee, “Object detection with mul- timodal large vision-language models: An in-depth re- view,”arXiv, vol. abs/2508.19294, 2025

  43. [43]

    Phygrasp: Gener- alizing robotic grasping with physics-informed large multimodal models,

    D. Guo, Y. Xiang, S. Zhaoet al., “Phygrasp: Gener- alizing robotic grasping with physics-informed large multimodal models,”arXiv, vol. abs/2402.16836, 2024

  44. [44]

    Probing perceptual con- stancy in large vision language models,

    H. Sun, S. Yu, Y. Liet al., “Probing perceptual con- stancy in large vision language models,”arXiv, vol. abs/2502.10273, 2025

  45. [45]

    From perception to cognition: A survey of vision-language interactive rea- soning in multimodal large language models,

    C. Zhou, M. Wang, Y. Maet al., “From perception to cognition: A survey of vision-language interactive rea- soning in multimodal large language models,”arXiv, vol. abs/2509.25373, 2025

  46. [46]

    From System 1 to System 2: A Survey of Reasoning Large Language Models

    Z.-Z. Li, D. Zhang, M.-L. Zhanget al., “From system 1 to system 2: A survey of reasoning large language models,” arXiv, vol. abs/2502.17419, 2025

  47. [47]

    The artificial intelligence cognitive examination: A survey on the evolution of multimodal evaluation from recognition to reasoning,

    M. Ravishankara and V . V . P . Maharaj, “The artificial intelligence cognitive examination: A survey on the evolution of multimodal evaluation from recognition to reasoning,”arXiv, vol. arXiv:2510.04141, 2025

  48. [48]

    A survey on machine learning approaches for modelling intuitive physics,

    J. Duan, A. Dasgupta, J. Fischeret al., “A survey on machine learning approaches for modelling intuitive physics,”IJCAI, vol. abs/2202.06481, 2022

  49. [49]

    Towards Reasoning in Large Language Models: A Survey

    J. Huang and K. C.-C. Chang, “Towards reason- ing in large language models: A survey,”arXiv, vol. abs/2212.10403, 2022

  50. [50]

    Foundation model driven robotics: A comprehensive review,

    M. T. Khan and A. Waheed, “Foundation model driven robotics: A comprehensive review,”CoRR, vol. abs/2507.10087, 2025

  51. [51]

    Large physics models: towards a collaborative approach with large language models and foundation models,

    K. G. Barman, S. Caron, E. Sullivan, and et al., “Large physics models: towards a collaborative approach with large language models and foundation models,”Eur. Phys. J. C, vol. 85, 2025

  52. [52]

    Understanding world or predicting future? a comprehensive survey of world models,

    J. Ding, Y. Zhang, Y. Shanget al., “Understanding world or predicting future? a comprehensive survey of world models,”ACM Comput. Surv., 2025

  53. [53]

    Generative physical ai in vision: A survey.arXiv preprint arXiv:2501.10928,

    D. Liu, J. Zhang, A.-D. Dinhet al., “Generative physical ai in vision: A survey,”CoRR, vol. abs/2501.10928, 2025

  54. [54]

    Is sora a world simulator? a comprehensive survey on general world models and beyond

    Z. Zhu, X. Wang, W. Zhaoet al., “Is sora a world simulator? a comprehensive survey on general world models and beyond,”arXiv, vol. abs/2405.03520, 2024

  55. [55]

    3d and 4d world modeling: A survey,

    L. Kong, W. Yang, J. Meiet al., “3d and 4d world modeling: A survey,”arXiv, vol. abs/2509.07996, 2025

  56. [56]

    Ma- chine learning for data-driven discovery in solid earth geoscience,

    K. J. Bergen, P . A. Johnson, M. V . de Hoopet al., “Ma- chine learning for data-driven discovery in solid earth geoscience,”Science, vol. 363, 2019

  57. [57]

    From 2d to 3d cognition: A brief survey of general world models,

    N. Xie, Z. Tian, L. Yanget al., “From 2d to 3d cognition: A brief survey of general world models,”CoRR, vol. abs/2506.20134, 2025

  58. [58]

    A survey on world mod- els grounded in acoustic physical information,

    X. Chen, L. Chang, X. Yuet al., “A survey on world mod- els grounded in acoustic physical information,”arXiv, vol. abs/2506.13833, 2025

  59. [59]

    From efficient multi- modal models to world models: A survey,

    X. Mai, Z. Tao, and J. L. et al., “From efficient multi- modal models to world models: A survey,”CoRR, vol. abs/2407.00118, 2024

  60. [60]

    Aligning cyber space with physical world: A comprehensive survey on embodied ai,

    Y. Liu, W. Chen, Y. Baiet al., “Aligning cyber space with physical world: A comprehensive survey on embodied ai,”CoRR, vol. abs/2407.06886, 2024. 16

  61. [61]

    Shapellm: Universal 3d object understanding for embodied interaction,

    Z. Qi, R. Dong, S. Zhanget al., “Shapellm: Universal 3d object understanding for embodied interaction,” in ECCV, 2024

  62. [62]

    Foundation models for au- tonomous driving perception: A survey through core capabilities,

    R. Sathyam and Y. Li, “Foundation models for au- tonomous driving perception: A survey through core capabilities,”IEEE Open J. Veh. Technol., vol. 6, 2025

  63. [63]

    Embodied ai: From llms to world models,

    T. Feng, X. Wang, and Y.-G. J. et al., “Embodied ai: From llms to world models,”arXiv, vol. abs/2509.20021, 2025

  64. [64]

    A survey: Learn- ing embodied intelligence from physical simulators and world models,

    X. Long, Q. Zhao, K. Zhanget al., “A survey: Learn- ing embodied intelligence from physical simulators and world models,”arXiv, vol. abs/2507.00917, 2025

  65. [65]

    A survey on large lan- guage model based autonomous agents,

    L. Wang, C. Ma, X. Fenget al., “A survey on large lan- guage model based autonomous agents,”Front. Comput. Sci., vol. 18, 2024

  66. [66]

    Large model empow- ered embodied ai: A survey on decision-making and embodied learning,

    W. Liang, R. Zhou, Y. Maet al., “Large model empow- ered embodied ai: A survey on decision-making and embodied learning,”arXiv, vol. abs/2508.10399, 2025

  67. [67]

    A survey of embodied learning for object-centric robotic manipulation,

    Y. Zheng, L. Yao, Y. Suet al., “A survey of embodied learning for object-centric robotic manipulation,”Mach. Intell. Res., vol. 22, 2025

  68. [68]

    Toward embodied agi: A re- view of embodied ai and the road ahead,

    Y. Wang and A. Sun, “Toward embodied agi: A re- view of embodied ai and the road ahead,”arXiv, vol. abs/2505.14235, 2025

  69. [69]

    A survey of embodied ai: From simulators to research tasks,

    J. Duan, S. Yu, and T. L. et al., “A survey of embodied ai: From simulators to research tasks,”IEEE Trans. Emerg. Top. Comput. Intell., vol. 6, 2022

  70. [70]

    A survey on robotics with foundation models: toward embodied ai,

    Z. Xu, K. Wu, J. Wenet al., “A survey on robotics with foundation models: toward embodied ai,”CoRR, vol. abs/2402.02385, 2024

  71. [71]

    A survey on deep reinforcement learning algorithms for robotic manipulation,

    D. Han, B. Mulyana, V . Stankovicet al., “A survey on deep reinforcement learning algorithms for robotic manipulation,”Sensors, vol. 23, 2023

  72. [72]

    GPT-4V(ision) system card,

    OpenAI, “GPT-4V(ision) system card,” 2023

  73. [73]

    Mask R-CNN,

    K. He, G. Gkioxari, P . Dolláret al., “Mask R-CNN,” in ICCV, 2017

  74. [74]

    Grounding DINO: Mar- rying DINO with Grounded Pre-Training for Open-Set Object Detection,

    S. Liu, Z. Zeng, T. Renet al., “Grounding DINO: Mar- rying DINO with Grounded Pre-Training for Open-Set Object Detection,” inECCV, vol. 15105, 2024

  75. [75]

    Qwen3-Omni Technical Report

    J. Xu, Z. Guo, H. Huet al., “Qwen3-Omni Technical Report,”arXiv:2509.17765, 2025

  76. [76]

    Inter-object discriminative graph modeling for indoor scene recognition,

    C. Song, H. Wu, and X. Ma, “Inter-object discriminative graph modeling for indoor scene recognition,”Knowl.- Based Syst., vol. 302, 2024

  77. [77]

    View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adap- tive View Synthesis,

    S. Varghese and V . Hoskere, “View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adap- tive View Synthesis,”arXiv:2406.18012, 2024

  78. [78]

    Cognition Guided Video Anomaly Detection Framework for Surveillance Services,

    M. Zhang, J. Wang, Q. Qiet al., “Cognition Guided Video Anomaly Detection Framework for Surveillance Services,”IEEE Trans. Serv. Comput., vol. 17, 2024

  79. [79]

    An expert ensemble for detecting anomalous scenes, interactions, and behaviors in autonomous driving,

    T. Ji, N. Chakraborty, A. Schreiberet al., “An expert ensemble for detecting anomalous scenes, interactions, and behaviors in autonomous driving,”Int. J. Robot. Res., vol. 44, 2025

  80. [80]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,

    C. Fu, P . Chen, Y. Shenet al., “MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,”arXiv, 2023

Showing first 80 references.