YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
hub Mixed citations
Is sora a world simulator? A comprehensive survey on general world models and beyond
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven generalization.
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.
ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.
A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
PhyGenBench supplies 160 prompts across 27 physical laws and an automated LLM/VLM evaluation pipeline to measure physical commonsense compliance in current text-to-video models.
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
PhysMani couples a physics-principled 3D Gaussian world model with a future-aware policy to achieve higher success rates on dynamic manipulation tasks in simulation and real robots.
AGRA is an Action-Grounded Representation Alignment objective that aligns intermediate video diffusion features with semantic representations to make world action model hidden states more useful for low-level robot control, improving localization, affordance, and robustness.
WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
The paper describes an open-source web application that applies generative world models to produce interactive cinematic explorations of quantum computing hardware grounded in AWS Braket specifications.
A tutorial that unifies explicit and implicit world models through shared predictive structure for applications in physical AI such as robotics.
citing papers explorer
-
YoCausal: How Far is Video Generation from World Model? A Causality Perspective
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
-
Learning to Theorize the World from Observation
NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven generalization.
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
-
Echo-Memory: A Controlled Study of Memory in Action World Models
A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.
-
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.
-
PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.
-
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
-
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
PhyGenBench supplies 160 prompts across 27 physical laws and an automated LLM/VLM evaluation pipeline to measure physical commonsense compliance in current text-to-video models.
-
A Survey on Vision-Language-Action Models for Embodied AI
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
-
PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation
PhysMani couples a physics-principled 3D Gaussian world model with a future-aware policy to achieve higher success rates on dynamic manipulation tasks in simulation and real robots.
-
Making Foresight Actionable: Repurposing Representation Alignment in World Action Models
AGRA is an Action-Grounded Representation Alignment objective that aligns intermediate video diffusion features with semantic representations to make world action model hidden states more useful for low-level robot control, improving localization, affordance, and robustness.
-
WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform
WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
-
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
-
Quantum Cinema: An Interactive Cinematic Exploration of Quantum Computing Hardware via Generative World Models
The paper describes an open-source web application that applies generative world models to produce interactive cinematic explorations of quantum computing hardware grounded in AWS Braket specifications.
-
A Tutorial on World Models and Physical AI
A tutorial that unifies explicit and implicit world models through shared predictive structure for applications in physical AI such as robotics.
- OpenWorldLib: A Unified Codebase and Definition of Advanced World Models