arxiv: 2510.04978 · v5 · submitted 2025-10-06 · 💻 cs.AI

Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

Kun Xiang , Terry Jingchen Zhang , Yinya Huang , Jixi He , Zirong Liu , Yueling Tang , Ruizhe Zhou , Lijing Luo

show 8 more authors

Youpeng Wen Xiuwei Chen Bingqian Lin Jianhua Han Hang Xu Hanhui Li Bin Dong Xiaodan Liang

This is my paper

Pith reviewed 2026-05-18 09:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords physical AIembodied intelligenceworld modelsphysics-grounded methodssymbolic reasoninggenerative modelsAI survey

0 comments p. Extension

The pith

This survey argues that aligning perception, reasoning, modeling and interaction with physical laws lets AI move beyond pattern matching to genuine physical understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys efforts to integrate physical principles into AI systems that handle perception, reasoning, modeling and interaction. It draws a clear line between theoretical physics reasoning and the applied physical understanding needed for real-world tasks. The authors review recent progress in symbolic reasoning, embodied systems and generative models, then make the case for systems that combine physical principles with embodied reasoning processes. If this integration succeeds, AI would no longer rely mainly on statistical patterns but could explain physical phenomena and predict future states. The stated goal is safer, more generalizable and interpretable world models for embodied intelligence.

Core claim

Intelligent systems that ground learning in both physical principles and embodied reasoning processes can transcend pattern recognition toward genuine understanding of physical laws, enabling next-generation world models capable of explaining physical phenomena and predicting future states.

What carries the argument

A unified bridging framework that connects structured symbolic reasoning, embodied systems and generative models through physics-grounded methods to produce applied physical understanding.

If this is right

AI gains improved real-world comprehension by grounding outputs in physical laws.
World models become able to explain observed phenomena and forecast future physical states.
Systems advance toward greater safety, generalization and interpretability in embodied tasks.
Perception, reasoning, modeling and interaction become mutually reinforcing rather than separate tracks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robotics applications could gain more reliable prediction of object interactions and dynamics.
Training data requirements might decrease if physical constraints replace some statistical learning.
The survey's distinctions between theoretical and applied understanding could guide evaluation benchmarks in cognitive robotics.

Load-bearing premise

Recent advances in physics-grounded methods across symbolic reasoning, embodied systems and generative models can be brought together in one framework that yields genuine physical understanding rather than just stronger pattern matching.

What would settle it

Build and test a unified physical AI system on novel physical scenarios; if its generalization and explanatory power remain no better than current pattern-based models, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2510.04978 by Bin Dong, Bingqian Lin, Hang Xu, Hanhui Li, Jianhua Han, Jixi He, Kun Xiang, Lijing Luo, Ruizhe Zhou, Terry Jingchen Zhang, Xiaodan Liang, Xiuwei Chen, Yinya Huang, Youpeng Wen, Yueling Tang, Zirong Liu.

**Figure 2.** Figure 2: Timeline of the development of Physical AI [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The proposed taxonomy of AI systems for physics understanding capabilities, organized into four [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Error pattern analysis of closed-Source models in multimodal physical reasoning. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-agent physics reasoning system across [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Video Generations by GPT4Motion (Figure courtesy of [163]). TABLE 3: Performance of Video Generation Models on representative physical and world modeling benchmarks. Model PhysicsIQ [164](↑) PhyGen [165](↑) VideoPhy [166] (↑) WorldModel Bench [167](↑) Sora [168] 0.10 0.44 0.28 6.11 Pika [169] 0.13 0.44 0.29 – CogVideoX [170] – 0.45 0.49 7.31 LaVie [171] – 0.36 0.41 – Kling [172] – 0.49 – 8.82 Modern video… view at source ↗

**Figure 8.** Figure 8: Overview of NavCoT (Figure used courtesy [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Overall framework of DriveDreamer (Figure [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

read the original abstract

The rapid advancement of embodied intelligence and world models has intensified efforts to integrate physical laws into AI systems, yet physical perception and symbolic physics reasoning have developed along separate trajectories without a unified bridging framework. This work provides a comprehensive overview of physical AI, establishing clear distinctions between theoretical physics reasoning and applied physical understanding while systematically examining how physics-grounded methods enhance AI's real-world comprehension across structured symbolic reasoning, embodied systems, and generative models. Through rigorous analysis of recent advances, we advocate for intelligent systems that ground learning in both physical principles and embodied reasoning processes, transcending pattern recognition toward genuine understanding of physical laws. Our synthesis envisions next-generation world models capable of explaining physical phenomena and predicting future states, advancing safe, generalizable, and interpretable AI systems. We maintain a continuously updated resource at https://github.com/AI4Phys/Awesome-AI-for-Physics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes Physical AI work across domains but stops short of a concrete bridging framework or evidence that methods exceed pattern matching.

read the letter

Colleague, the main thing to know is that this is a survey paper that pulls together literature on physical AI, distinguishing theoretical reasoning from applied understanding and reviewing methods in symbolic systems, embodied agents, and generative models. It advocates grounding learning in physical principles plus embodied processes to move beyond correlations toward real comprehension of physical laws, with a GitHub repo for updates. That organization is the core contribution and could help someone map the current spread of ideas in embodied AI and world models. The authors do a reasonable job laying out separate trajectories and arguing for their integration, which gives the piece a clear through-line without overclaiming new theorems or data. The soft spots sit in the execution of that argument. The paper calls for a unified framework that produces genuine understanding, yet it does not sketch any explicit integration mechanism, define the term operationally, or include comparative evaluations showing better physical prediction than standard statistical approaches. Claims rest on narrative synthesis of prior work rather than fresh validation or case studies. This leaves the central vision more aspirational than demonstrated. The paper is aimed at researchers already working in embodied intelligence or world models who need a structured overview rather than a deep technical dive. A reader new to the subfield might pick up useful categorization and pointers; someone following the literature closely will see mostly extensions of familiar themes. It deserves peer review. Surveys that clarify boundaries and point to open gaps can still serve the community even when the synthesis stays high-level, and the topic is active enough that a careful referee could push the integration section forward.

Referee Report

2 major / 2 minor

Summary. The manuscript is a survey on Physical AI that reviews the integration of physical laws into AI systems. It distinguishes between theoretical physics reasoning and applied physical understanding, examines physics-grounded methods in symbolic reasoning, embodied systems, and generative models, and advocates for grounding learning in physical principles and embodied reasoning to achieve genuine understanding beyond pattern recognition, envisioning advanced world models for explaining and predicting physical phenomena.

Significance. If the advocated synthesis holds, this survey could significantly influence the field by promoting more interpretable and generalizable AI systems that incorporate physical understanding, potentially leading to safer and more robust embodied intelligence and world models. The continuous resource at the GitHub link adds value for the community.

major comments (2)

[Abstract] Abstract: The positioning of the work as providing a 'unified bridging framework' for transcending pattern recognition is load-bearing for the central claim, yet the review of separate trajectories across symbolic reasoning, embodied systems, and generative models does not include a concrete integration mechanism or formal definition of 'genuine understanding' versus improved statistical correlation.
[Synthesis sections] Synthesis/Advocacy sections: The distinction between theoretical and applied physical understanding is presented without external benchmarks or comparative evaluations on physical prediction tasks, which undermines the assertion that reviewed methods achieve transcendence beyond pattern matching.

minor comments (2)

Update the GitHub resource link with the latest references to maintain currency in this rapidly evolving field.
Ensure figure captions and tables clearly indicate the scope of reviewed methods to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our survey manuscript. We address each major comment point by point below, clarifying the scope of a survey paper while indicating specific revisions that will strengthen the presentation without altering its core contribution as a synthesis of the literature.

read point-by-point responses

Referee: [Abstract] Abstract: The positioning of the work as providing a 'unified bridging framework' for transcending pattern recognition is load-bearing for the central claim, yet the review of separate trajectories across symbolic reasoning, embodied systems, and generative models does not include a concrete integration mechanism or formal definition of 'genuine understanding' versus improved statistical correlation.

Authors: As a survey, the manuscript does not introduce a new technical integration mechanism; the 'unified bridging framework' is intended as the organizational taxonomy and cross-domain analysis that connects the reviewed trajectories through shared physical principles. We will revise the abstract to describe this more precisely as a conceptual synthesis and organizational structure. We will also add a short subsection early in the introduction that offers a working definition of 'genuine understanding' in physical AI, drawing on distinctions from the literature such as causal intervention, counterfactual reasoning, and systematic generalization on physical tasks, to differentiate it from improved statistical correlation. revision: partial
Referee: [Synthesis sections] Synthesis/Advocacy sections: The distinction between theoretical and applied physical understanding is presented without external benchmarks or comparative evaluations on physical prediction tasks, which undermines the assertion that reviewed methods achieve transcendence beyond pattern matching.

Authors: The referee is correct that the current draft presents the distinction conceptually without new empirical comparisons. Because this is a survey, we do not conduct fresh experiments. In revision we will expand the synthesis sections to reference existing physical prediction benchmarks and datasets from the literature (e.g., those appearing in physics-informed neural network evaluations and embodied reasoning challenges), summarize performance trends reported in the cited works, and explicitly note the limitations of a review format in providing direct head-to-head evaluations. This will make the scope and evidential basis clearer while preserving the survey's role. revision: yes

Circularity Check

0 steps flagged

Survey with no derivations or self-referential predictions; claims rest on external citations

full rationale

This is a literature survey without equations, fitted parameters, or original derivations. The central advocacy for grounding AI in physical principles plus embodied reasoning and for next-generation world models is presented as a synthesis of reviewed advances across symbolic reasoning, embodied systems, and generative models. No load-bearing step reduces by construction to a self-definition, a fitted input renamed as prediction, or a self-citation chain; distinctions between theoretical and applied understanding are offered as organizational framing rather than a derived result. The paper therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on the premise that separate development trajectories of perception and symbolic reasoning require a unified bridging framework; no free parameters or invented entities are introduced as this is a review rather than a modeling paper.

axioms (1)

domain assumption Physics-grounded methods can enhance AI's real-world comprehension beyond pattern recognition
Invoked in the abstract when advocating for systems that ground learning in physical principles and embodied reasoning.

pith-pipeline@v0.9.0 · 5730 in / 1267 out tokens · 24953 ms · 2026-05-18T09:53:36.279346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean, IndisputableMonolith/Cost/FunctionalEquation.lean reality_from_one_distinction, washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our survey uniquely focuses on the evolutionary trajectory that unites these four capabilities into a coherent paradigm... hybrid approaches that integrate physics-grounded architectures, physics-informed training, and symbolic reasoning into unified frameworks.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

physics-informed neural networks (PINNs)... neuro-symbolic integration... differentiable physics engines

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 21 internal anchors

[1]

World models,

D. Ha and J. Schmidhuber, “World models,”arXiv, 2018

work page 2018
[2]

AI meets physics: a comprehensive survey,

L. Jiao, X. Song, C. Youet al., “AI meets physics: a comprehensive survey,”Artif. Intell. Rev., vol. 57, 2024

work page 2024
[3]

Newtonian Scene Understanding: Unfolding the Dy- namics of Objects in Static Images,

R. Mottaghi, H. Bagherinezhad, and M. e. a. Rastegari, “Newtonian Scene Understanding: Unfolding the Dy- namics of Objects in Static Images,” inCVPR, 2016

work page 2016
[4]

Interaction Networks for Learning about Objects, Relations and Physics,

P . W. Battaglia, R. Pascanu, M. Laiet al., “Interaction Networks for Learning about Objects, Relations and Physics,” inNeurIPS, vol. 29, 2016

work page 2016
[5]

SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning,

K. Xiang, H. Li, T. J. Zhanget al., “SeePhys: Does Seeing Help Thinking? – Benchmarking Vision-Based Physics Reasoning,”arXiv:2505.19099, 2025

work page arXiv 2025
[6]

I-PHYRE: Interactive Physical Reasoning,

S. Li, K. Wu, C. Zhanget al., “I-PHYRE: Interactive Physical Reasoning,” inICLR, 2024. 15

work page 2024
[7]

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly,

L. Ma, J. Wen, M. Linet al., “PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly,” inNeurIPS, 2025

work page 2025
[8]

LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

A. Cherian, R. Corcodel, S. Jainet al., “LLMPhy: Com- plex Physical Reasoning Using Large Language Models and World Models,”arXiv:2411.08027, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

ComPhy: Compositional Physical Reasoning of Objects and Events from Videos,

Z. Chen, K. Yi, Y. Liet al., “ComPhy: Compositional Physical Reasoning of Objects and Events from Videos,” inICLR, 2022

work page 2022
[10]

Semi-supervised classifica- tion with graph convolutional networks,

T. N. Kipf and M. Welling, “Semi-supervised classifica- tion with graph convolutional networks,” inICLR, 2017

work page 2017
[11]

Graph Attention Networks,

P . Veliˇ ckovi´ c, G. Cucurull, A. Casanova, and et al., “Graph Attention Networks,” inICLR, 2018

work page 2018
[12]

Inductive Representation Learning on Large Graphs,

W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive Representation Learning on Large Graphs,” inNeurIPS, vol. 30, 2017

work page 2017
[13]

Visual In- teraction Networks: Learning a Physics Simulator from Video,

N. Watters, D. Zoran, T. Weber, and et al., “Visual In- teraction Networks: Learning a Physics Simulator from Video,” inNeurIPS, vol. 30, 2017

work page 2017
[14]

A Com- positional Object-Based Approach to Learning Physical Dynamics,

M. B. Chang, T. D. Ullman, A. Torralbaet al., “A Com- positional Object-Based Approach to Learning Physical Dynamics,” inICLR, 2017

work page 2017
[15]

Motion- Craft: Physics-Based Zero-Shot Video Generation,

A. Montanaro, L. Savant Aira, E. Aielloet al., “Motion- Craft: Physics-Based Zero-Shot Video Generation,” in NeurIPS, vol. 37, 2024

work page 2024
[16]

Videorepa: Learning physics for video generation through relational alignment with foundation models

X. Zhang, J. Liao, S. Zhanget al., “VideoREPA: Learning Physics for Video Generation through Relational Align- ment with Foundation Models,”arXiv:2505.23656, 2025

work page arXiv 2025
[17]

How Do Transformers Model Physics? Investigating the Simple Harmonic Oscillator,

S. Kantamneni, Z. Liu, and M. Tegmark, “How Do Transformers Model Physics? Investigating the Simple Harmonic Oscillator,”Entropy, vol. 26, 2024

work page 2024
[18]

A Physics-guided Multimodal Transformer Path to Weather and Climate Sciences,

J. Han, H. Chen, K. Hanet al., “A Physics-guided Multimodal Transformer Path to Weather and Climate Sciences,”CoRR, 2025

work page 2025
[19]

Solving fluid flow problems using semi-supervised symbolic regression on sparse data,

Y. M. F. El Hasadi and J. T. Padding, “Solving fluid flow problems using semi-supervised symbolic regression on sparse data,”AIP Adv., vol. 9, 2019

work page 2019
[20]

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of- Thought Reasoning,

X. Chen, R. Zhang, D. Jianget al., “MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of- Thought Reasoning,”arXiv:2506.05331, 2025

work page arXiv 2025
[21]

AlphaDrive: Un- leashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning,

B. Jiang, S. Chen, Q. Zhanget al., “AlphaDrive: Un- leashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning,”arXiv, 2025

work page 2025
[22]

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving

Z. Yuan, J. Tang, J. Luoet al., “AutoDrive-R 2: Incen- tivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving,”arXiv:2509.01944, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Reinforcement learning meets large language models: A survey of advancements and applications across the llm lifecycle,

K. Liu, D. Yang, Z. Qian, W. Yin, Y. Wang, H. Li, J. Liu, P . Zhai, Y. Liu, and L. Zhang, “Reinforcement learning meets large language models: A survey of advancements and applications across the llm lifecycle,” 2025. [Online]. Available: https: //arxiv.org/abs/2509.16679

work page arXiv 2025
[24]

Causal Modeling of Dynamical Systems,

S. Bongers, T. Blom, and J. M. Mooij, “Causal Modeling of Dynamical Systems,”arXiv, 2018

work page 2018
[25]

Using Causal Threads to Explain Changes in a Dynamic System,

R. B. Allen, “Using Causal Threads to Explain Changes in a Dynamic System,” inICADL, vol. 14458, 2023

work page 2023
[26]

PhysORD: a neuro-symbolic approach for physics-infused motion prediction in off- road driving,

Z. Zhao, B. Li, Y. Duet al., “PhysORD: a neuro-symbolic approach for physics-infused motion prediction in off- road driving,” inIROS, 2024

work page 2024
[27]

Functional optimiza- tion of fluidic devices with differentiable stokes flow,

T. Du, K. Wu, A. Spielberget al., “Functional optimiza- tion of fluidic devices with differentiable stokes flow,” ACM Trans. Graph., vol. 39, 2020

work page 2020
[28]

Scalable Differen- tiable Physics for Learning and Control,

Y.-L. Qiao, J. Liang, V . Koltunet al., “Scalable Differen- tiable Physics for Learning and Control,” inICML, vol. 119, 2020

work page 2020
[29]

GPT-4o System Card

OpenAI, Aaron Hurst, Adam Lerer, and et al., “GPT-4o System Card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Visual Instruction Tuning,

H. Liu, C. Li, Q. Wuet al., “Visual Instruction Tuning,” inNeurIPS, vol. 36, 2023

work page 2023
[31]

OpenAI o1 System Card,

OpenAI, “OpenAI o1 System Card,”arXiv, 2024

work page 2024
[32]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liuet al., “Qwen2.5-VL Technical Report,”arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Claude 3.7 Sonnet System Card,

Anthropic, “Claude 3.7 Sonnet System Card,” 2025

work page 2025
[34]

Gemini 2.5 Pro Model Card,

Google DeepMind, “Gemini 2.5 Pro Model Card,” 2025

work page 2025
[35]

Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge,

H. Liang, R. Wu, B. Zenget al., “Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge,”arXiv:2509.06079, 2025

work page arXiv 2025
[36]

GAIA-1: A Generative World Model for Autonomous Driving,

A. Hu, L. Russell, H. Yeoet al., “GAIA-1: A Generative World Model for Autonomous Driving,”arXiv, 2023

work page 2023
[37]

DriveDreamer: Towards Real-world-driven World Models for Au- tonomous Driving,

X. Wang, Z. Zhu, G. Huanget al., “DriveDreamer: Towards Real-world-driven World Models for Au- tonomous Driving,” inECCV, vol. 15106, 2024

work page 2024
[38]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamchetiet al., “Open- VLA: An Open-Source Vision-Language-Action Model,” arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Pi0: A vision- language-action flow model for general robot control,

K. Black, N. Brown, D. Driesset al., “Pi0: A vision- language-action flow model for general robot control,” inRSS, 2025

work page 2025
[40]

A survey on multimodal large language models,

S. Yin, C. Fu, S. Zhaoet al., “A survey on multimodal large language models,”Natl. Sci. Rev., vol. 11, 2024

work page 2024
[41]

Large lan- guage models predict human sensory judgments across six modalities,

R. Marjieh, I. Sucholutsky, P . van Rijnet al., “Large lan- guage models predict human sensory judgments across six modalities,”Sci. Rep., vol. 14, 2024

work page 2024
[42]

Object detection with mul- timodal large vision-language models: An in-depth re- view,

R. Sapkota and M. Karkee, “Object detection with mul- timodal large vision-language models: An in-depth re- view,”arXiv, vol. abs/2508.19294, 2025

work page arXiv 2025
[43]

Phygrasp: Gener- alizing robotic grasping with physics-informed large multimodal models,

D. Guo, Y. Xiang, S. Zhaoet al., “Phygrasp: Gener- alizing robotic grasping with physics-informed large multimodal models,”arXiv, vol. abs/2402.16836, 2024

work page arXiv 2024
[44]

Probing perceptual con- stancy in large vision language models,

H. Sun, S. Yu, Y. Liet al., “Probing perceptual con- stancy in large vision language models,”arXiv, vol. abs/2502.10273, 2025

work page arXiv 2025
[45]

From perception to cognition: A survey of vision-language interactive rea- soning in multimodal large language models,

C. Zhou, M. Wang, Y. Maet al., “From perception to cognition: A survey of vision-language interactive rea- soning in multimodal large language models,”arXiv, vol. abs/2509.25373, 2025

work page arXiv 2025
[46]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Z.-Z. Li, D. Zhang, M.-L. Zhanget al., “From system 1 to system 2: A survey of reasoning large language models,” arXiv, vol. abs/2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

The artificial intelligence cognitive examination: A survey on the evolution of multimodal evaluation from recognition to reasoning,

M. Ravishankara and V . V . P . Maharaj, “The artificial intelligence cognitive examination: A survey on the evolution of multimodal evaluation from recognition to reasoning,”arXiv, vol. arXiv:2510.04141, 2025

work page arXiv 2025
[48]

A survey on machine learning approaches for modelling intuitive physics,

J. Duan, A. Dasgupta, J. Fischeret al., “A survey on machine learning approaches for modelling intuitive physics,”IJCAI, vol. abs/2202.06481, 2022

work page arXiv 2022
[49]

Towards Reasoning in Large Language Models: A Survey

J. Huang and K. C.-C. Chang, “Towards reason- ing in large language models: A survey,”arXiv, vol. abs/2212.10403, 2022

work page internal anchor Pith review arXiv 2022
[50]

Foundation model driven robotics: A comprehensive review,

M. T. Khan and A. Waheed, “Foundation model driven robotics: A comprehensive review,”CoRR, vol. abs/2507.10087, 2025

work page arXiv 2025
[51]

Large physics models: towards a collaborative approach with large language models and foundation models,

K. G. Barman, S. Caron, E. Sullivan, and et al., “Large physics models: towards a collaborative approach with large language models and foundation models,”Eur. Phys. J. C, vol. 85, 2025

work page 2025
[52]

Understanding world or predicting future? a comprehensive survey of world models,

J. Ding, Y. Zhang, Y. Shanget al., “Understanding world or predicting future? a comprehensive survey of world models,”ACM Comput. Surv., 2025

work page 2025
[53]

Generative physical ai in vision: A survey.arXiv preprint arXiv:2501.10928,

D. Liu, J. Zhang, A.-D. Dinhet al., “Generative physical ai in vision: A survey,”CoRR, vol. abs/2501.10928, 2025

work page arXiv 2025
[54]

Is sora a world simulator? a comprehensive survey on general world models and beyond

Z. Zhu, X. Wang, W. Zhaoet al., “Is sora a world simulator? a comprehensive survey on general world models and beyond,”arXiv, vol. abs/2405.03520, 2024

work page arXiv 2024
[55]

3d and 4d world modeling: A survey,

L. Kong, W. Yang, J. Meiet al., “3d and 4d world modeling: A survey,”arXiv, vol. abs/2509.07996, 2025

work page arXiv 2025
[56]

Ma- chine learning for data-driven discovery in solid earth geoscience,

K. J. Bergen, P . A. Johnson, M. V . de Hoopet al., “Ma- chine learning for data-driven discovery in solid earth geoscience,”Science, vol. 363, 2019

work page 2019
[57]

From 2d to 3d cognition: A brief survey of general world models,

N. Xie, Z. Tian, L. Yanget al., “From 2d to 3d cognition: A brief survey of general world models,”CoRR, vol. abs/2506.20134, 2025

work page arXiv 2025
[58]

A survey on world mod- els grounded in acoustic physical information,

X. Chen, L. Chang, X. Yuet al., “A survey on world mod- els grounded in acoustic physical information,”arXiv, vol. abs/2506.13833, 2025

work page arXiv 2025
[59]

From efficient multi- modal models to world models: A survey,

X. Mai, Z. Tao, and J. L. et al., “From efficient multi- modal models to world models: A survey,”CoRR, vol. abs/2407.00118, 2024

work page arXiv 2024
[60]

Aligning cyber space with physical world: A comprehensive survey on embodied ai,

Y. Liu, W. Chen, Y. Baiet al., “Aligning cyber space with physical world: A comprehensive survey on embodied ai,”CoRR, vol. abs/2407.06886, 2024. 16

work page arXiv 2024
[61]

Shapellm: Universal 3d object understanding for embodied interaction,

Z. Qi, R. Dong, S. Zhanget al., “Shapellm: Universal 3d object understanding for embodied interaction,” in ECCV, 2024

work page 2024
[62]

Foundation models for au- tonomous driving perception: A survey through core capabilities,

R. Sathyam and Y. Li, “Foundation models for au- tonomous driving perception: A survey through core capabilities,”IEEE Open J. Veh. Technol., vol. 6, 2025

work page 2025
[63]

Embodied ai: From llms to world models,

T. Feng, X. Wang, and Y.-G. J. et al., “Embodied ai: From llms to world models,”arXiv, vol. abs/2509.20021, 2025

work page arXiv 2025
[64]

A survey: Learn- ing embodied intelligence from physical simulators and world models,

X. Long, Q. Zhao, K. Zhanget al., “A survey: Learn- ing embodied intelligence from physical simulators and world models,”arXiv, vol. abs/2507.00917, 2025

work page arXiv 2025
[65]

A survey on large lan- guage model based autonomous agents,

L. Wang, C. Ma, X. Fenget al., “A survey on large lan- guage model based autonomous agents,”Front. Comput. Sci., vol. 18, 2024

work page 2024
[66]

Large model empow- ered embodied ai: A survey on decision-making and embodied learning,

W. Liang, R. Zhou, Y. Maet al., “Large model empow- ered embodied ai: A survey on decision-making and embodied learning,”arXiv, vol. abs/2508.10399, 2025

work page arXiv 2025
[67]

A survey of embodied learning for object-centric robotic manipulation,

Y. Zheng, L. Yao, Y. Suet al., “A survey of embodied learning for object-centric robotic manipulation,”Mach. Intell. Res., vol. 22, 2025

work page 2025
[68]

Toward embodied agi: A re- view of embodied ai and the road ahead,

Y. Wang and A. Sun, “Toward embodied agi: A re- view of embodied ai and the road ahead,”arXiv, vol. abs/2505.14235, 2025

work page arXiv 2025
[69]

A survey of embodied ai: From simulators to research tasks,

J. Duan, S. Yu, and T. L. et al., “A survey of embodied ai: From simulators to research tasks,”IEEE Trans. Emerg. Top. Comput. Intell., vol. 6, 2022

work page 2022
[70]

A survey on robotics with foundation models: toward embodied ai,

Z. Xu, K. Wu, J. Wenet al., “A survey on robotics with foundation models: toward embodied ai,”CoRR, vol. abs/2402.02385, 2024

work page arXiv 2024
[71]

A survey on deep reinforcement learning algorithms for robotic manipulation,

D. Han, B. Mulyana, V . Stankovicet al., “A survey on deep reinforcement learning algorithms for robotic manipulation,”Sensors, vol. 23, 2023

work page 2023
[72]

GPT-4V(ision) system card,

OpenAI, “GPT-4V(ision) system card,” 2023

work page 2023
[73]

Mask R-CNN,

K. He, G. Gkioxari, P . Dolláret al., “Mask R-CNN,” in ICCV, 2017

work page 2017
[74]

Grounding DINO: Mar- rying DINO with Grounded Pre-Training for Open-Set Object Detection,

S. Liu, Z. Zeng, T. Renet al., “Grounding DINO: Mar- rying DINO with Grounded Pre-Training for Open-Set Object Detection,” inECCV, vol. 15105, 2024

work page 2024
[75]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Huet al., “Qwen3-Omni Technical Report,”arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Inter-object discriminative graph modeling for indoor scene recognition,

C. Song, H. Wu, and X. Ma, “Inter-object discriminative graph modeling for indoor scene recognition,”Knowl.- Based Syst., vol. 302, 2024

work page 2024
[77]

View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adap- tive View Synthesis,

S. Varghese and V . Hoskere, “View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adap- tive View Synthesis,”arXiv:2406.18012, 2024

work page arXiv 2024
[78]

Cognition Guided Video Anomaly Detection Framework for Surveillance Services,

M. Zhang, J. Wang, Q. Qiet al., “Cognition Guided Video Anomaly Detection Framework for Surveillance Services,”IEEE Trans. Serv. Comput., vol. 17, 2024

work page 2024
[79]

An expert ensemble for detecting anomalous scenes, interactions, and behaviors in autonomous driving,

T. Ji, N. Chakraborty, A. Schreiberet al., “An expert ensemble for detecting anomalous scenes, interactions, and behaviors in autonomous driving,”Int. J. Robot. Res., vol. 44, 2025

work page 2025
[80]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,

C. Fu, P . Chen, Y. Shenet al., “MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models,”arXiv, 2023

work page 2023

Showing first 80 references.