{"total":25,"items":[{"citing_arxiv_id":"2605.22882","ref_index":74,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation","primary_cat":"cs.CV","submitted_at":"2026-05-20T21:36:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12090","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Action-Conditioned iVideoGPT [23], FlowDreamer [24], EnerVerse [25], PlaNet [26], TransDreamer [27], V-JEP A [28]. . . Langugae-Conditoned MoCoGAN [29], U-Net [30], Latte [ 31], Wan [32], Sora 2 [ 33]. . . Embodied World Model SWIM [34], DreamDojo [ 35], RoboDreamer [36], RoboScape [37]. . . WM for VLA Imitation Learning Ctrl-World [38], RoboScape [37], DREMA [ 39] Reinforcement Learning Dreamer to Control [ 40] DreamerV2 [ 41], Dreamer 4 [ 42], RISE [ 43] DreamerV3 [44], DayDreamer [45], World-Env [46], RoboScape-R [47] WMPO [48], WoVR [49], VLA-RFT [50], RWML [51], MoDem-V2 [52] World-Gymnast [53], RWM-U [54], World4RL [55], VIPER [ 56] PhysWorld [57], Diffusion Reward [58], GenReward [59] Evaluation Ctrl-World [38], Veo Robotics [60], Interactive World Simulator [61]"},{"citing_arxiv_id":"2605.10942","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[57] Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling. arXiv preprint arXiv:2505.15659, 2025. [58] Pengfei Zhou, Liliang Chen, Shengcong Chen, Di Chen, Wenzhi Zhao, Rongjun Jin, Guanghui Ren, and Jianlan Luo. Act2goal: From world model to general goal-conditioned policy.arXiv preprint arXiv:2512.23541, 2025. [59] Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Ro- bodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. [60] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large"},{"citing_arxiv_id":"2605.07794","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-08T14:31:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. [26] Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139, 2023. [27] Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Ro- bodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. [28] Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot"},{"citing_arxiv_id":"2605.06481","ref_index":98,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-07T16:06:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023. arXiv:2304.13705. [97] Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-VLA: Soft-prompted transformer as scalable cross-embodiment Vision-Language-Action model.arXiv preprint arXiv:2510.10274, 2025. [98] Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. RoboDreamer: Learning compositional world models for robot imagination. InInternational Conference on Machine Learning (ICML), 2024. arXiv:2404.12377. [99] Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. LIBERO-PRO: Towards robust and fair evaluation of Vision-Language-Action models"},{"citing_arxiv_id":"2604.28185","ref_index":104,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-30T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24661","ref_index":81,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations","primary_cat":"cs.RO","submitted_at":"2026-04-27T16:24:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ACO-MoE recovers 95.3% of clean-input performance in visual control tasks under Markov-switching corruptions by routing restoration experts and anchoring representations to clean foreground masks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"World models such as DreamerV3 [13] are trained with reconstruction objectives that incentivize the latent representation to encode corruption-specific fea- tures. Under dynamically switching perturbations, the world model must simultaneously represent multiple corruption patterns, contaminating the latent state and severely degrading the imagined rollouts used for policy optimization [81]. Even non-reconstructive planners such as TD-MPC2 [18] still rely on clean input, so model-based failures compound across every predicted future state. To mitigate these issues, existing solutions fall into two broad categories. The first relies on data augmenta- arXiv:2604.24661v2 [cs.RO] 28 Apr 2026 Agent-Centric Visual Reinforcement Learning under Dynamic Perturbations"},{"citing_arxiv_id":"2604.21241","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors","primary_cat":"cs.RO","submitted_at":"2026-04-23T03:17:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"can translate into broader task coverage in robotics. At the same time, the field has been actively experimenting with different design choices-from diffusion/flow-based action heads that improve continuous control fidelity (e.g., Octo [3], pi0 [4], RDT [5]), to richer multimodal structures and training signals (e.g., GR-1/GR-2 [6], [7], RoboDreamer [8], and RL-augmented variants [9], [10]). These parallel threads reflect an ongoing evolution of VLA paradigms rather than a settled blueprint [11]. Alongside architectural progress, the robotics community continues to accumulate data from increasingly diverse plat- forms and setups. Differences in embodiments, controllers, camera configurations, and annotation conventions make it"},{"citing_arxiv_id":"2604.19092","ref_index":62,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-21T05:09:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, while these models can gen- erate visually plausible manipulation videos, it remains unclear whether they preserve physical consistency and interaction dynamics in a manner that sup- ports executable control. Motivated by this challenge, a growing body of work develops robotics-oriented world models that explicitly target embodied con- trol [3,57,62]. DreamGen [23] finetunes video world models to better learn the robot's physical constraints and movement capabilities. Large Video Planner (LVP) [9] investigates video-conditioned planning, leveraging predicted visual rollouts as intermediate representations for downstream control. WoW [12] em- phasizes physically grounded intuition through large-scale embodied interaction"},{"citing_arxiv_id":"2604.17887","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement","primary_cat":"cs.RO","submitted_at":"2026-04-20T06:57:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[53] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipula- tion with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 2 [54] Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024. 2 [55] Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. 1, 3, 4, 7 17"},{"citing_arxiv_id":"2604.16592","ref_index":231,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Human Cognition in Machines: A Unified Perspective of World Models","primary_cat":"cs.RO","submitted_at":"2026-04-17T17:51:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"✗ ✗ ✗ ✗ ✓ ✓ ✗World Model exploration via en- semble disagreement for zero- shot task adaptation DayDreamer [189] 2022 Robot.✓ ✗ ✗ ✗ ✓ ✓ ✗Dreamer-style World Model learning directly on physical robots from sparse rewards Dream to Manipulate [12] 2024 Robot.✗ ✓ ✗ ✗ ✓ ✗ ✗Compositional 3DGS with object decomposition for imagination- based imitation learning RoboDreamer [231] 2024 Robot.✗ ✗ ✓ ✗ ✓ ✗ ✗Compositional diffusion World Model factorizing language in- structions into task primitives GenRL [119] 2024 Robot.✗ ✗ ✓ ✗ ✓ ✓ ✗Foundation World Models for generalization in embodied RL via multimodal priors DreamGen [77] 2025 Robot.✗ ✓ ✗ ✗ ✓ ✗ ✗Fine-tunes Cosmos Predict-2.5 as synthetic robot data engine for policy training"},{"citing_arxiv_id":"2604.15938","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-17T10:56:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robotic policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Diffusion Policy [3] was the first to demonstrate that the iterative denoising mechanism of diffusion models outperforms traditional Gaussian policies in high- dimensional continuous control tasks, enabling smoother, more stable, and more diverse action distributions. Subsequent studies have extended this framework to various domains of robotic manipulation, including trajectory generation [9] [34], grasp planning [12] [20], 4D spatiotemporal awareness [17]and visual data aug- mentation for vision-based manipulation [32], providing new pathways for com- plex task decomposition, generalizable control, and multimodal perception. Diffusion Models. Diffusion models are generative models that learn data dis- tributions through a two-stage noising-denoising Markov process."},{"citing_arxiv_id":"2604.11751","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Grounded World Model for Semantically Generalizable Planning","primary_cat":"cs.RO","submitted_at":"2026-04-13T17:25:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11386","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation","primary_cat":"cs.RO","submitted_at":"2026-04-13T12:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"intensive, limiting broad access. Generative video models [1,50] offer a cost- effective way to synthesize policy training data. UniPi [14] and AVDC [21] cast robot planning as text-to-video generation (AVDC further estimates inverse dynamics with a pretrained flow network); UniSim [53] learns a unified real-world simulator across text and control inputs; RoboDreamer [61] targets compositional generalization via text parsing; and IRASim [62] performs trajectory-conditioned video generation but focuses on arm motion only. In this work, our world simulator turns action-consistent simulation trajectories into high-fidelity, real-style data. 3 Compositional World Simulation 3.1 Problem Formulation In the context of robotic manipulation, collecting real-world data is often a"},{"citing_arxiv_id":"2604.08168","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ViVa: A Video-Generative Value Model for Robot Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-04-09T12:28:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"rob: Advancing video generation of task-oriented hand-object interaction for generalizable robotic manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27683- 27693, 2025. 4 [58] Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 3 [59] Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. 3 [60] Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source"},{"citing_arxiv_id":"2604.06168","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Action Images: End-to-End Policy Learning via Multiview Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-07T17:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"learning 4d embodied world models. arXiv preprint arXiv:2504.20995 (2025) [71] Zheng, D., Huang, S., Zhao, L., Zhong, Y., Wang, L.: Towards learn- ing a generalist model for embodied navigation. In: Proceedings of the Action Images: End-to-End Policy Learning via Multiview Video Generation 25 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13624-13634 (2024) [72] Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024) [73] Zhu, H., Wang, Y., Zhou, J., Chang, W., Zhou, Y., Li, Z., Chen, J., Shen, C., Pang, J., He, T.: Aether: Geometric-aware unified world modeling. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision."},{"citing_arxiv_id":"2603.16666","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Fast-WAM: Do World Action Models Need Test-time Future Imagination?","primary_cat":"cs.CV","submitted_at":"2026-03-17T15:33:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"bodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. [22] Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doer- sch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. [23] John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, and Jinwoo Shin. Dual-stream diffusion for world-model augmented vision-language-action model, 2025. URL https: //arxiv.org/abs/2510.27607. [24] Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative"},{"citing_arxiv_id":"2602.20309","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models","primary_cat":"cs.LG","submitted_at":"2026-02-23T19:55:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.15922","ref_index":92,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Action Models are Zero-shot Policies","primary_cat":"cs.RO","submitted_at":"2026-02-17T15:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[90] Jiaming Zhou, Ke Ye, Jiayi Liu, Teli Ma, Zifan Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, and Junwei Liang. Exploring the limits of vision-language-action manipulations in cross-task generalization.arXiv preprint arXiv:2505.15660, 2025. 2 [91] Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. 5 [92] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. 5, 7 36"},{"citing_arxiv_id":"2602.11075","ref_index":97,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RISE: Self-Improving Robot Policy with Compositional World Model","primary_cat":"cs.RO","submitted_at":"2026-02-11T17:43:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21998","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Causal World Modeling for Robot Control","primary_cat":"cs.CV","submitted_at":"2026-01-29T17:07:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024. [95] Pengfei Zhou, Liliang Chen, Shengcong Chen, Di Chen, Wenzhi Zhao, Rongjun Jin, Guanghui Ren, and Jianlan Luo. Act2goal: From world model to general goal-conditioned policy.arXiv preprint arXiv:2512.23541, 2025. [96] Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024. [97] Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets."},{"citing_arxiv_id":"2512.05564","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ProPhy: Progressive Physical Alignment for Dynamic World Simulation","primary_cat":"cs.CV","submitted_at":"2025-12-05T09:39:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01773","ref_index":78,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"IGen: Scalable Data Generation for Robot Learning from Open-World Images","primary_cat":"cs.RO","submitted_at":"2025-12-01T15:15:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.17697","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Syntax: Action Semantics Learning for App Agents","primary_cat":"cs.AI","submitted_at":"2025-06-21T12:08:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Action Semantics Learning trains app agents to align with the semantic effects of actions via a Semantic Estimator module, improving robustness to out-of-distribution scenarios over syntax-matching fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.12705","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DreamGen: Unlocking Generalization in Robot Learning through Video World Models","primary_cat":"cs.RO","submitted_at":"2025-05-19T04:55:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}