The paper unifies emerging graph-based world models under a new paradigm and proposes a taxonomy organized by spatial, physical, and logical relational inductive biases.
hub
Embodied AI Agents: Modeling the World
26 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Proposes a structured concept-centric memory system for embodied agents that connects object, scene, transition, and skill memories to support coarse-to-fine retrieval and improve task performance over baselines.
COMAP co-evolves textual world models and agent policies for LLMs through on-policy self-distillation, yielding up to 16.75% relative gains on embodied planning, web navigation, and tool-use tasks.
A multi-agent video world model using simplex rotary agent encoding and sparse hub attention achieves better fidelity, controllability, and consistency than baselines while generalizing from 2 to 4 players.
SCRIPT presents a scalable diffusion policy with JAST-DiT architecture, nonlinear history conditioning, and RLHR post-training that claims to outperform prior methods on text alignment, motion quality, and physical realism while scaling on a 1200-hour dataset.
LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
Sentinel-VLA adds metacognitive status monitoring to VLA models for on-demand reasoning and error recovery, reporting over 30% higher real-world task success than prior SOTA.
Vision-language models use semantic signals more than syntactic ones to bind words like 'image' to actual visual inputs, with implications for robustness in multimodal systems.
AgentComm achieves nearly 50% bandwidth reduction in embodied agent communication via LLM semantic processing, importance-aware transmission, and a task knowledge base, with negligible impact on task completion.
Morphology-conditioned quadrupedal world model enables zero-shot generalization to new robot embodiments for locomotion tasks.
GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.
VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.
SpeechLess enables micro-utterance AR interactions by binding prior interactions to personal spatial context for intent extrapolation.
A three-stage training pipeline internalizes world-model simulation and success estimation in LLM agents for improved planning on search and math tasks.
MagicSim is a unified embodied interaction infrastructure built on a deterministic batched runtime and shared MDP that supports diverse world construction, execution, task evaluation, automatic rollout generation, and interactive agent interfaces.
IndustryAssetEQA integrates episodic telemetry representations with an FMEA knowledge graph to support embodied question answering over industrial assets, showing large gains in validity and reduced overclaims versus LLM baselines.
An empirical study of JEPA world models identifies architecture, training objective, and planning choices that yield a model outperforming DINO-WM and V-JEPA-2-AC on navigation and manipulation tasks.
The paper presents a proof-of-concept closed-loop system using consumer EEG to detect high cognitive engagement and defer multi-agent robotic communications until lower workload.
Proposes a four-layer hierarchical communication architecture for 6G-enabled human-robot interaction and shows feasibility via a 5G-based prototype with millisecond latency and stable operation.
An agentic framework generates executable physics simulation code from text prompts via coordinated planning, coding, visual, and physics agents that iterate to satisfy both prompt fidelity and physical constraints.
Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.
A tutorial that unifies explicit and implicit world models through shared predictive structure for applications in physical AI such as robotics.
A systematic review of resource consumption threats in LLMs that organizes the problem along the full pipeline from threat induction to mitigation.
citing papers explorer
-
Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players
A multi-agent video world model using simplex rotary agent encoding and sparse hub attention achieves better fidelity, controllability, and consistency than baselines while generalizing from 2 to 4 players.
-
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.
-
VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction
VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.
- OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
- Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints