MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
hub Mixed citations
Worldscore: A unified evaluation benchmark for world generation
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps between visual appeal and physical fidelity.
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
Target-Bench shows the best off-the-shelf video world model scores only 0.341 on semantic target-approaching and directional consistency, with fine-tuning on a small robot dataset yielding measurable gains.
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.
TunerDiT adds event-partitioned masking and cross-event prompt fusion to diffusion transformers for training-free multi-event video generation, with gains scaling by event count on a new Meve benchmark.
PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.
WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.
MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claiming open-source SOTA performance.
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
citing papers explorer
No citing papers match the current filters.