{"total":17,"items":[{"citing_arxiv_id":"2606.00793","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MBench: A Comprehensive Benchmark on Memory Capability for Video World Models","primary_cat":"cs.CV","submitted_at":"2026-05-30T16:17:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00499","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OptiWorld: Optimal Control for Video World Generation under Physical Constraints","primary_cat":"cs.CV","submitted_at":"2026-05-30T03:13:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31590","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-29T17:56:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TunerDiT adds event-partitioned masking and cross-event prompt fusion to diffusion transformers for training-free multi-event video generation, with gains scaling by event count on a new Meve benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17912","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform","primary_cat":"cs.RO","submitted_at":"2026-05-18T06:18:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15185","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Quantitative Video World Model Evaluation for Geometric-Consistency","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12090","ref_index":217,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"Physics-IQ [202], WorldScore [203], EWMBench [ 204] Action Plausibility WorldSimBench [205], Wow , wo, val! [206] Action Policy General MetaWorld [207], RLBench [ 208], Robomimic [209], Franka Kitchen [ 210], ManiSkill [ 211] ManiSkill2 [151], ManiSkill3 [ 212], RoboCasa [152], CAL VIN [213], VIMAbench [214] VLMbench [215], LIBERO [216], Libero-plus [ 4], Libero-pro [ 217], Libero-X [ 218] COLOSSEUM [219], AGNOSTOS [220], RoboEval [221], RoboVerse [222], PolaRiS [223] RoboMME [224], GenManip [ 225], VLABench [ 226], RoboSuite [227], RoboLab [228] SimplerEnv [229], ARNOLD [230], GemBench [231] Bimanual and Humanoid Form Robo T win [153], BiGym [232], HumanoidBench [ 233] HumanoidGen [234] Mobile Manipulation ManipulaTHOR [235], HomeRobot [236], BEHA VIOR-1K [237]"},{"citing_arxiv_id":"2605.01799","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Embody4D: A Generalist Data Engine for Embodied 4D World Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-03T09:39:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Multiview information is crucial for embod- ied manipulation and planning, and there is an urgent need for embodied multiview 4D world models to provide comprehensive spatial environmental representations for downstream tasks. reasoning tasks [27]. However, while the physical world is inherently three- dimensional, most existing world models remain confined to 2D pixel space [13]. This limitation results in impoverished spatial representations, depriving mod- els of the embodied spatial reasoning and comprehensive multi-view information essential for downstream manipulation [20]. To empower world models with spatial intelligence [47], it is imperative to achieve dimensional lifting, as shown in Fig. 1: inferring dynamic, multi-view"},{"citing_arxiv_id":"2604.20157","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HumanScore: Benchmarking Human Motions in Generated Videos","primary_cat":"cs.CV","submitted_at":"2026-04-22T03:51:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps between visual appeal and physical fidelity.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Crafter [37], which aggregates large prompt sets and objective metrics with hu- mancorrelationstudiesforbettervideoqualityevaluations;T2V-CompBench[61], which evaluates generated content by targeting compositional generalization across attributes, spatial relations, and action binding; Video-Bench [21], which presents a toolkit to better cover action consistency and motion/temporal qual- ity; WorldScore [14], emphasizes holistic \"world generation\" quality; Human- Bench and MotionBench [22,63], which focus on human-centric perception or motion understanding. While several of these benchmarks include metrics for human motions, none comprehensively evaluates thebiomechanical plausibility of human figures and motionsin generated videos. Our benchmark fills this gap"},{"citing_arxiv_id":"2604.18564","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MultiWorld: Scalable Multi-Agent Multi-View Video World Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T17:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"dom actions, which produced meaningless episodes and negatively affected world model training. B Metrics In this section, we present the detailed implementation of the Reprojection Error (RPE) and the Action Following Ability. B.1 Reprojection Error To evaluate multi-view geometric consistency, we employ Reprojection Error (RPE), a standard metric in visual SLAM. Following the methodology of [8,52], we utilize DROID-SLAM [42] for scene reconstruction. This process involves extracting frame-to-frame features, then refining camera poses (Gt) and pixel- wise depth maps (dt) via differentiable Dense Bundle Adjustment (DBA). By enforcing optical flow constraints, this approach ensures robust structure-from- motion. The RPE is computed as the average Euclidean distance between the"},{"citing_arxiv_id":"2604.14268","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:59:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claiming open-source SOTA performance.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Our method outperforms previous approaches in extension plausibility, content richness, and overall quality. Results of Camera Control Capability.We quantitatively evaluate the camera control capability of WorldStereo 2.0 in Tab. 6, while ablation studies are performed in Tab. 7. Both evaluations are applied with 100 out-of-distribution images selected from [15] with challenging trajectories. Notably, WorldStereo 2.0 outperforms all video-based competitors by achieving the lowest errors across all camera metrics. Furthermore, it also delivers superior visual quality and semantic alignment. For the ablation study in Tab. 7, since Keyframe-V AE introduces significant changes to the latent representations, directly applying it without training the main network is unfair and yields limited"},{"citing_arxiv_id":"2512.14614","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling","primary_cat":"cs.CV","submitted_at":"2025-12-16T17:22:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01843","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models","primary_cat":"cs.CV","submitted_at":"2025-12-01T16:28:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.18373","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2025-11-23T09:43:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.17792","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?","primary_cat":"cs.CV","submitted_at":"2025-11-21T21:36:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Target-Bench shows the best off-the-shelf video world model scores only 0.341 on semantic target-approaching and directional consistency, with fine-tuning on a small robot dataset yielding measurable gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.00062","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Simulation with Video Foundation Models for Physical AI","primary_cat":"cs.CV","submitted_at":"2025-10-28T22:44:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[16] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin CM Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRSS, 2023. 22 [17] Databricks. Delta lake: Open-source storage framework that enables building lakehouses.https: //delta.io/, 2019. Open-source project, Delta Lake. 6 [18] Google DeepMind. Veo 3, 5 2025. URLhttps://deepmind.google/technologies/veo/veo-3/. 35 [19] Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 36 [20] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution"},{"citing_arxiv_id":"2507.07982","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling","primary_cat":"cs.CV","submitted_at":"2025-07-10T17:55:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.12705","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamGen: Unlocking Generalization in Robot Learning through Video World Models","primary_cat":"cs.RO","submitted_at":"2025-05-19T04:55:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}