{"total":137,"items":[{"citing_arxiv_id":"2605.23878","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:34:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23856","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Point Tracking Improves World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-22T17:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23345","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models","primary_cat":"cs.CV","submitted_at":"2026-05-22T08:06:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22809","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-21T17:57:17+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22051","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation","primary_cat":"cs.CV","submitted_at":"2026-05-21T06:38:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EasyVFX decouples VFX generation via frequency-aware Mixture-of-Experts and test-time training to achieve realistic effects with limited resources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21484","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:59:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fixed-Point Distillation constructs one-step correction targets for discrete diffusion generators via partial corruption and single teacher refinement, lifted into continuous features with a multi-bandwidth drift loss and straight-through estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20624","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T02:16:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVIS applies autoregressive diffusion models to video inverse problems by streaming restoration with measurement-consistent initialization, reducing latency from 114s to 4s and raising throughput to 1.18 FPS (or 5.91 FPS in the Flash variant).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20388","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-19T18:38:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TrajPilot predicts candidate future trajectories from egocentric context and uses them to condition action prediction in an embedding space, outperforming VLM and planner baselines on Ego-Exo4D, Ego4D, and other datasets with gains increasing at longer horizons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19957","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks","primary_cat":"cs.CV","submitted_at":"2026-05-19T15:10:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19728","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls","primary_cat":"cs.CV","submitted_at":"2026-05-19T12:02:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19242","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhyWorld: Physics-Faithful World Model for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-19T01:28:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18743","ref_index":1,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WorldString: Actionable World Representation","primary_cat":"cs.AI","submitted_at":"2026-05-18T17:58:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes WorldString, a differentiable neural model for the state manifold of actionable physical objects learned directly from 3D or video data as a building block for world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16713","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GeoWorld-VLM: Geometry from World Models for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T23:52:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15960","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Imperfect World Models are Exploitable","primary_cat":"cs.AI","submitted_at":"2026-05-15T13:54:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A formal theory proves model exploitation is essentially unavoidable on large policy sets in RL, generalizes reward hacking results, and derives a safe horizon for a relaxed version of exploitation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15831","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation","primary_cat":"cs.SD","submitted_at":"2026-05-15T10:35:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15618","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Video Prediction Learns Better World Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T04:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Latent prediction video models exhibit a distinct robustness profile across corruption, occlusion, fine-grained discrimination, and temporal sensitivity compared to other self-supervised video models when used as world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15391","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PanoWorld: Geometry-Consistent Panoramic Video World Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-14T20:24:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15178","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14426","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A plug-and-play generative framework for multi-satellite precipitation estimation","primary_cat":"physics.ao-ph","submitted_at":"2026-05-14T06:18:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRISMA introduces a plug-and-play latent generative model that improves multi-sensor precipitation estimates by learning an unconditional prior from IMERG data and constraining it with independent sensor-specific branches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14398","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Coding Agent Is Good As World Simulator","primary_cat":"cs.AI","submitted_at":"2026-05-14T05:33:41+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14333","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T03:57:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14274","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL","primary_cat":"cs.CV","submitted_at":"2026-05-14T02:18:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13775","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data","primary_cat":"cs.RO","submitted_at":"2026-05-13T16:54:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13724","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-13T16:06:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13565","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Qwen-Image-VAE-2.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-13T14:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16395","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:43:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OrbiSim builds a differentiable physics engine from world models to support gradient-based policy optimization and contact modeling in robotics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10730","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Qwen-Image-2.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:34:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"shows images generated by Qwen-Image-2.0-Distillation with only 4 NFEs. Across diverse prompts, including portraits, landscapes, and natural scenes, the 4-NFE student preserves visual quality, semantic alignment, and compositional coherence comparable to the 40-step teacher, while reducing inference cost. where ξ denotes an independent Gaussian noise vector, t∈[ 0, 1] is the diffusion time sampled from a prescribed distribution p(t) (e.g., a logit-normal distribution), and xt is obtained by linearly interpolating between the conditionally generated clean samplex θ and the noise vectorξ: xt = (1−t)x θ +tξ. (5) Here, sfake(xt, t, c) =∇ xt logp fake,t(xt |c) denotes the conditional score function associated with the"},{"citing_arxiv_id":"2605.10408","ref_index":1,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VISOR: A Vision-Language Model-based Test Oracle for Testing Robots","primary_cat":"cs.SE","submitted_at":"2026-05-11T11:46:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"quantifies uncertainty when assessing both task correctness and task quality, thereby providing a measure of how much the VLM's assessment can be trusted. We evaluate VISOR using two VLMs, GPT and Gemini, across four robotic tasks, analyzing their ability to assess correctness and quality while accounting for uncertainty. As future work, we plan to evaluate World models like Cosmos from Nvidia [1]. These emerging models can support complex video analytics over large volumes of recorded and live video, enabling richer, more contextual understanding of visual content. Moreover, we will expand our evaluation to more VLMs and develop a voting- based mechanism to consolidate assessments from multiple VLMs. Acknowledgments This work is supported by the InnoGuard Marie Skłodowska-Curie"},{"citing_arxiv_id":"2605.09423","ref_index":56,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning","primary_cat":"cs.AI","submitted_at":"2026-05-10T08:51:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"A central bottleneck is the difficulty of simulating embodied environments at scale. Training and evaluating embodied agents require not only visually plausible 3D scenes, but also physically grounded worlds in which agents can be deployed, take actions, observe consequences, and receive task feedback. Existing embodied platforms, such as AI2-THOR [40], Habitat [56], CARLA [19], ThreeDWorld [25], and iGibson [42], provide important infrastructure for embodied AI, but they largely depend on manually designed scene collections that are expensive to construct, limited in diversity, and fixed once released. Procedurally generated platforms such as ProcTHOR [ 16] and Infinigen [60] improve scalability, yet their diversity is still bounded by hand-designed templates or"},{"citing_arxiv_id":"2605.08528","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SceneFactory: GPU-Accelerated Multi-Agent Driving Simulation with Physics-Based Vehicle Dynamics","primary_cat":"cs.MA","submitted_at":"2026-05-08T22:23:11+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SceneFactory delivers a batched GPU platform for physics-based multi-agent autonomous driving simulation that achieves 127x higher throughput than non-vectorized PhysX while supporting articulated dynamics and road-condition friction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07834","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GenAI Powered Dynamic Causal Inference with Unstructured Data","primary_cat":"stat.ME","submitted_at":"2026-05-08T15:03:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A GenAI-based method extracts representations from unstructured data and uses a neural network to fit marginal structural models that recover causal effects of treatment feature sequences including their positions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07794","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-08T14:31:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. [24] Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025. [25] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. [26] Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong."},{"citing_arxiv_id":"2605.07230","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CASCADE: Context-Aware Relaxation for Speculative Image Decoding","primary_cat":"cs.CV","submitted_at":"2026-05-08T04:32:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This results in enhanced drafter performance for speculative decoding where image quality is improved. • A unified framework that achieves up to3.6 × speedupwhile retaining the original image quality and prompt-fidelity of the AR model. 2 Related Works Multi-modal AR models.AR modeling has been extended beyond language to other modalities including images [ 28, 38, 13, 22], video [ 2, 48], audio [ 11, 45], robotics [ 20, 51] and diverse modalities [29]. AR image generation [8, 10, 7, 34, 28, 42] has emerged as a powerful alternative to diffusion [17] and generative adversarial network [12] based approaches. Compared to diffusion models [ 17, 35], AR offers flexible resolution control and seamless multi-modality integration. To"},{"citing_arxiv_id":"2605.07061","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do Joint Audio-Video Generation Models Understand Physics?","primary_cat":"cs.SD","submitted_at":"2026-05-08T00:14:07+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"produce physically inconsistent generations, including implausible visual artifacts and audio that contradicts the depicted scene dynamics. These failures suggest that generating perceptually realistic audio and video is fundamentally different from modeling the causal physical relationships that jointly govern both modalities. This distinction is particularly important for downstream applications such as world simulation [ 1, 5, 4], embodied agents [ 7, 11, 24], and educational content, where maintaining physically consistent audio-visual behavior is essential. Evaluating joint audio-video generation, therefore, requires a benchmark that goes beyond perceptual quality and semantic alignment to test whether audio, video, and their interaction remain physically consistent."},{"citing_arxiv_id":"2605.06628","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation","primary_cat":"eess.IV","submitted_at":"2026-05-07T17:42:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constrained sensors across modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06388","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models","primary_cat":"cs.CV","submitted_at":"2026-05-07T15:05:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Freedom (DoF) end-effector actions covering position, rotation, and gripper state, and a language instruction. For trajectory success classiﬁcation, we use SOAR [ 81] which contains roughly 30.5K success/failure class episodes for WidowX 250 with a 1:2 class split. Encoder variants. We compare two encoder families. reconstruction-aligned encoders f PIX ϕ include: Stable Diffusion 3 (SD3) V AE [16] with D=16, V A-V AE [71] with D=32, and Cosmos [ 1] with D=16; for these, αψ ≡ I. Semantics-aligned encoders f REP ϕ include: V -JEPA 2.1 [38] with D=1024, Web-DINO [18], adapted from DINOv2 [ 41], with D=1024, and SigLIP 2 [ 61] with D=1152. For semantic encoders, we evaluate both native latents and compact latents from a pretrained S-V AE adapter [ 78], which maps D→d with d=96. Adapter, decoder, and transition model."},{"citing_arxiv_id":"2605.06337","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Earth-o1: A Grid-free Observation-native Atmospheric World Model","primary_cat":"cs.CV","submitted_at":"2026-05-07T14:27:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Earth-o1 learns continuous atmospheric dynamics from ungridded observations and matches operational IFS forecast skill in hindcasts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06192","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields","primary_cat":"cs.CV","submitted_at":"2026-05-07T13:06:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"focuses on preserving robot spatial motion and robot-object interaction dynamics in generated videos. It maps actions and kinematic states intoStructured Kinematic-to-Visual Action Fields(KV AFs) aligned with the visual generation domain, and integrates them through EDLS-guided event-aware bidirectional fusion. On WorldArena, EA-WM achieves the bestP3CScore among compared models. 9 References [1] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. [2] Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli,"},{"citing_arxiv_id":"2605.05187","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)","primary_cat":"cs.CV","submitted_at":"2026-05-06T17:52:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05148","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What Matters in Practical Learned Image Compression","primary_cat":"cs.CV","submitted_at":"2026-05-06T17:17:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A practical learned image codec delivers 2.3-3x bitrate savings over AV1/VVC and 20-40% over prior learned codecs while encoding 12MP images in 230ms on iPhone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01477","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion","primary_cat":"cs.RO","submitted_at":"2026-05-02T14:52:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00078","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Being-H0.7: A Latent World-Action Model from Egocentric Videos","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 15 [17] Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024. [18] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. [19] Lars Berscheid, Pascal Meißner, and Torsten Kröger. Robot learning of shifting objects for grasping in cluttered"},{"citing_arxiv_id":"2604.26694","ref_index":16,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising","primary_cat":"cs.RO","submitted_at":"2026-04-29T14:01:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834-9844, 2025. [15] Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024. [16] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. [17] GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al."},{"citing_arxiv_id":"2604.26182","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lifting Embodied World Models for Planning and Control","primary_cat":"cs.CV","submitted_at":"2026-04-28T23:59:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-efficient and generalizing to unseen environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24762","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer","primary_cat":"cs.CV","submitted_at":"2026-04-27T17:59:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23629","ref_index":12,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation","primary_cat":"cs.GR","submitted_at":"2026-04-26T09:44:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"missing link is a generation pathway that produces natively simulation-compatible content. 8.3 Toward World Models: Structured 3D as Foundation for Intelligent Simu- lation Perhaps the most far-reaching open question is the role of production-ready 3D generation in constructing world models. As discussed in Section 1, video-based [10, 11, 275, 276] and structured [12, 14, 15, 277] paradigms are converging toward architectures that combine structured 3D assets for physical grounding with neural rendering for visual diversity. In this architecture, the generation methods surveyed here serve as the content supply chain: geometry synthesis, texturing, rigging, and scene composition directly determine how rich and physically faithful the resulting world model can be."},{"citing_arxiv_id":"2604.22748","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond","primary_cat":"cs.AI","submitted_at":"2026-04-24T17:48:47+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"cess Rate(ASR) measures how often a planner that uses the world model's rollouts to select actions achieves the task goal in the real environment: ASR = 1 N N∑ i=1 1 [ taski succeeds under policy derived fromˆp ] . TheCounterfactual Outcome Deviation(COD) measures intervention sensitivity by comparing rollout outcomes under two policiesa(1) 1:H anda (2) 1:H that differ at a single intervention stepk: COD(k) =E [ d ( ˆz(1) H ,ˆz(2) H )] , wheredis a task-relevant distance (e.g., goal-state distance in physical tasks, edit distance in software tasks). When COD is low, a world model is largely unresponsive to changes in action, which makes it uninformative for counterfactual planning. Together, ASR and COD provide a more direct link between world-model quality and downstream agentic performance: ASR assesses whether the model supports good decisions,"},{"citing_arxiv_id":"2604.22227","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies","primary_cat":"cs.CY","submitted_at":"2026-04-24T05:02:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"mathematical structure explicit. Lemma 3(Bounded reciprocal compatibility).For everyi∈Handj∈A, 0≤mij≤1.(19) Hence 0≤wij≤aij≤1.(20) Proof. Because all demand and supply vectors are nonnegative, each inner product in (2) is nonnegative, som ij≥0. By Cauchy-Schwarz, ⣨ dH i ,u A j ⟩ ≤ dH i  uA j ,(21) and therefore ⣨ dH i ,u A j ⟩ dH i  uA j  +ε <1.(22) The same bound holds for the second term in(2). Their average therefore lies in[0, 1). Since wij =a ijmij witha ij∈[0,1], it follows that0≤wij≤aij≤1. Proposition 4(Global well-posedness).Assume A is finite andνi≥0for all i. Then for every initial conditionx(0) =x 0∈Rd, the system(16)admits a unique global solution for allt≥0. Proof. Definef(x) =b−Ax−ν⊙x⊙3."},{"citing_arxiv_id":"2604.21914","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis","primary_cat":"cs.RO","submitted_at":"2026-04-23T17:57:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sensitive to viewpoint changes, which we address by building viewpoint-robust closed-loop control. B. Generative Models for Robot Planning With the recent advancements in generation models in terms of image quality, temporal consistency, and scene gen- eralization, many works have begun to explore their potential in autonomous driving [20], [21], [22] and robotics [23]. In the realm of robot planning, some works directly employ generative models to produce action videos and then generate the corresponding actions based on the video outputs. These videos can serve as synthesized sub-goals to provide visual guidance for subsequent policy generation [24], [25], or be utilized to extract robot actions by leveraging inverse dynam-"},{"citing_arxiv_id":"2604.21456","ref_index":2,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics","primary_cat":"cs.LG","submitted_at":"2026-04-23T09:13:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tempered sequential Monte Carlo samples from a Boltzmann-tilted distribution over controllers to optimize trajectories and policies under differentiable dynamics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"base distribution to a target to reduce variance and estimate normalizing constants [71], and sequential Monte Carlo samplers strengthen this idea via resampling and MCMC move steps in the canonical reweight-resample-move template [26]. REFERENCES [1] [MJX] jax.lax.while_loop in solver.py prevents computation of backward gradients (#2259) . https://github. com/google-deepmind/mujoco/issues/2259. 2024. [2] Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. \"Cosmos world foundation model platform for physical ai\". In: arXiv preprint arXiv:2501.03575 (2025). [3] Bo Ai, Stephen Tian, Haochen Shi, Yixuan Wang, Tobias Pfaff, Cheston Tan, Henrik I Christensen, Hao Su, Jiajun Wu,"}],"limit":50,"offset":0}