{"total":15,"items":[{"citing_arxiv_id":"2606.17730","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ActWorld: From Explorable to Interactive World Model via Action-Aware Memory","primary_cat":"cs.CV","submitted_at":"2026-06-16T09:47:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ActWorld extends navigation-centric world models to support mid-rollout object interactions via chunk-autoregressive generation, action-aware memory routing, and a persistent memory bank, backed by a 100K annotated interaction dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09803","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Echo-Memory: A Controlled Study of Memory in Action World Models","primary_cat":"cs.CV","submitted_at":"2026-06-08T17:54:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09507","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Prisma-World: Camera-Controllable Multi-Agent Video World Model","primary_cat":"cs.CV","submitted_at":"2026-06-08T13:59:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from flexible agent counts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02753","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data","primary_cat":"cs.CV","submitted_at":"2026-06-01T18:20:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01164","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends","primary_cat":"cs.CV","submitted_at":"2026-05-31T11:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00499","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OptiWorld: Optimal Control for Video World Generation under Physical Constraints","primary_cat":"cs.CV","submitted_at":"2026-05-30T03:13:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28816","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players","primary_cat":"cs.CV","submitted_at":"2026-05-27T17:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-agent video world model using simplex rotary agent encoding and sparse hub attention achieves better fidelity, controllability, and consistency than baselines while generalizing from 2 to 4 players.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22718","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WorldKV: Efficient World Memory with World Retrieval and Compression","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:55:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WorldKV enables persistent world memory in autoregressive video diffusion models by selectively retrieving and compressing KV-cache chunks, matching full-cache fidelity at roughly twice the throughput without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18601","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:12:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Incantation is the first video world model to use per-frame natural language conditioning for simultaneous multi-entity control and concept-level cross-entity transfer in interactive video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18431","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T14:04:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SP-CoR is a multimodal LLM framework using dynamics-aware sampling, spectral-physics view fusion, and prompt distillation that outperforms baselines on the new CoopSR benchmark and EgoTeam dataset for multi-robot cooperative spatial reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23993","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Nano World Models: A Minimalist Implementation of Future Video Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-17T22:46:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Nano World Models supplies a unified minimalist codebase and evaluation framework for studying diffusion forcing in video prediction across control, games, and robot domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09965","ref_index":142,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse","primary_cat":"cs.CV","submitted_at":"2026-05-11T04:16:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"crafted multiverse of games; and ultimately, (4) The Creator, the future where AI transcends playing to become the simulator itself, autonomously generating and evolving infinite game multiverses. To establish a rigorous foundation, we formalize the interaction between an AI agent and a game environment as a Partially Observable Markov Decision Processes (POMDPs) [142]. A game can be described as a tuple M=⟨G, S, A, T, R,Ω, O, γ⟩, where: •Gis a set of potential goals or objectives. Eachg∈Gis a task objective (e.g., natural language prompts, target images, or specific sub-tasks) that dictate the agent's current mission. •Sis a set of states. Eachs∈Srepresents the internal state of the environment. •Ais a set of actions."},{"citing_arxiv_id":"2605.08567","ref_index":25,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models","primary_cat":"cs.CV","submitted_at":"2026-05-09T00:00:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, most existing works focus on egocentric settings, where actions primarily correspond to navigation, such as Genie-3 [ 23], RELIC [ 8], and WorldPlay [ 27]. These settings involve limited direct interaction with the environment. Other works instead concentrate on narrow domains, such as robot manipulation, including Vid2World [10], BridgeV2W [ 2], WoVR [ 12], and Ctrl-World [ 3], or on Minecraft gameplay [ 25, 5]. A key limitation of these approaches is their limited investigation of complex physical interactions, as most mainly focus on simple navigation, or rigid-body dynamics such as picking, pushing, and grasping. Physics in Video Diﬀusion Models Recent work has begun to investigate how well video diﬀusion models capture physical principles and whether they can serve as implicit world models [ 13, 33, 22, 37], and"},{"citing_arxiv_id":"2604.22847","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes","primary_cat":"cs.CV","submitted_at":"2026-04-22T00:46:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18564","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MultiWorld: Scalable Multi-Agent Multi-View Video World Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T17:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"must synthesize visually coherent videos across diverse perspectives, while en- suring observations from multiple agents remain geometrically consistent. (3) Framework Scalability: real-world environments involve variable numbers of agents and views, requiring a framework that generalizes across configurations without assuming fixed agent counts or camera setups. Previous works [34,67] assume a fixed number of agents or predefined camera views, which limits their applicability in diverse real-world scenarios. In this work, we introduce MultiWorld to address the challenges of multi- agent, multi-view world modeling, enabling flexible scaling of agent and view counts, as illustrated in Fig. 1. (1) To achieve Multi-Agent Controllability, we"}],"limit":50,"offset":0}