{"total":143,"items":[{"citing_arxiv_id":"2606.26922","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Risk-Aware Selective Multimodal Driver Monitoring with Driver-State World Modeling","primary_cat":"cs.RO","submitted_at":"2026-06-25T11:59:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A cost-aware selective inference framework combines a lightweight multimodal student model and driver-state world modeling to reduce unsafe false negatives in driver monitoring while keeping low latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22449","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Evolving Cognitive Framework via Causal World Modeling for Embodied Scientific Intelligence","primary_cat":"cs.AI","submitted_at":"2026-06-21T11:46:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes a self-evolving cognitive framework integrating causal world modeling, intervention-driven reasoning, and continual refinement for embodied scientific intelligence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23699","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models","primary_cat":"cs.CV","submitted_at":"2026-05-22T14:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23345","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models","primary_cat":"cs.CV","submitted_at":"2026-05-22T08:06:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22809","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-21T17:57:17+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22138","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Efficient Agentic Reasoning Through Self-Regulated Simulative Planning","primary_cat":"cs.AI","submitted_at":"2026-05-21T08:11:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Taliente,HenrySleight,LindaPetrini,JulianMichael,BeatriceAlex,PasqualeMinervini,YandaChen, JoeBenton,andEthanPerez. Inversescalingintest-timecompute.TransactionsonMachineLearning Research,2025. [29] Google. Trydeepresearchandournewexperimentalmodelingemini,youraiassistant. https://blog. google/products-and-platforms/products/gemini/google-gemini-deep-research/,Decem- ber2024. Accessed: 2026-04-04. [30] DavidHaandJürgenSchmidhuber. Worldmodels.arXivpreprintarXiv:1803.10122,2(3):440,2018. [31] DanijarHafner,TimothyLillicrap,JimmyBa,andMohammadNorouzi. Dreamtocontrol: Learning behaviorsbylatentimagination.arXivpreprintarXiv:1912.01603,2019. [32] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson."},{"citing_arxiv_id":"2605.22089","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model","primary_cat":"cs.CV","submitted_at":"2026-05-21T07:31:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21963","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data","primary_cat":"cs.LG","submitted_at":"2026-05-21T03:50:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CMWM is a recurrent latent world model for forecasting patient trajectories like annual eGFR in CKD, reporting 7.28% lower MAE than a tuned GPT-5.5 baseline on a 2232-patient cohort with gains from dialogue data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21800","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation","primary_cat":"cs.LG","submitted_at":"2026-05-20T22:58:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20910","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching","primary_cat":"cs.CV","submitted_at":"2026-05-20T08:55:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlowLong generates videos several times longer than native model windows by blending adjacent predictions with Tweedie matching to enforce manifold and temporal consistency while using stochastic noise injection early and deterministic sampling later.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20833","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MemGym: a Long-Horizon Memory Environment for LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-20T07:25:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20811","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation","primary_cat":"cs.RO","submitted_at":"2026-05-20T07:05:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Demo-JEPA enables one-shot cross-embodiment imitation by mapping visual demonstrations to shared latent future trajectories that serve as subgoals for the target agent's own forward dynamics planning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20448","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?","primary_cat":"cs.CV","submitted_at":"2026-05-19T20:01:19+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18137","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-05-18T09:46:16+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17580","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation","primary_cat":"cs.AI","submitted_at":"2026-05-17T18:14:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ECG-WM combines ODE physiological priors with latent diffusion models to generate intervention-conditioned ECG trajectories and uses diffusion stochasticity for uncertainty-aware clinical risk assessment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16899","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map","primary_cat":"cs.CV","submitted_at":"2026-05-16T09:21:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16725","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models","primary_cat":"cs.AI","submitted_at":"2026-05-16T00:18:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Alice uses preservation conflicts from failed candidate updates to create class-stratified hypotheses and guide exploration, improving executable world-model learning under prior misalignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16692","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control","primary_cat":"cs.LG","submitted_at":"2026-05-15T23:08:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EfficientTDMPC extends the TD-MPC family with model ensembles, return averaging, and uncertainty penalties to reach SOTA sample efficiency on hard continuous control benchmarks in low-data regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16530","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation","primary_cat":"cs.CV","submitted_at":"2026-05-15T18:27:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SWoMo decouples symbolic rule-based motion modeling via scene graphs from visual realism via diffusion models, trained through inverse pairing of real cataract surgery videos reconstructed in the simulator for sim-to-real translation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16030","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mind Dreamer: Untethering Imagination via Active Causal Intervention on Latent Manifolds","primary_cat":"cs.LG","submitted_at":"2026-05-15T15:05:58+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15618","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Video Prediction Learns Better World Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T04:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Latent prediction video models exhibit a distinct robustness profile across corruption, occlusion, fine-grained discrimination, and temporal sensitivity compared to other self-supervised video models when used as world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15524","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neural Point-Forms","primary_cat":"cs.LG","submitted_at":"2026-05-15T01:44:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Neural point-forms are introduced as permutation-invariant neural layers that output learned form-comparison matrices for point clouds, with a claimed consistency proof under sampling and manifold assumptions and competitive results on synthetic and biological data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15477","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EgoExo-WM: Unlocking Exo Video for Ego World Models","primary_cat":"cs.CV","submitted_at":"2026-05-14T23:35:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15178","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15256","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReactiveGWM: Steering NPC in Reactive Game World Models","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:52:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13740","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning POMDP World Models from Observations with Language-Model Priors","primary_cat":"cs.LG","submitted_at":"2026-05-13T16:18:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pinductor leverages language-model priors to learn POMDP world models from limited trajectories, matching privileged-access methods in performance and exceeding tabular baselines in sample efficiency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"other parts of the pipeline, such as the observation distance, the planner, and the demonstration buffer, which are currently fixed. Third, the reliance on LLM API calls induces high variance inPinductor and related methods [14], which future work should aim to reduce. References [1] Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, second edition, 2018. 1, 3 [2] David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. [3] Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025. 1 [4] K. J Åström. Optimal control of Markov processes with incomplete state information.Journal of Mathematical Analysis and Applications, 10(1):174-205, February 1965."},{"citing_arxiv_id":"2605.13013","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-13T05:07:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12733","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Generalist to Specialist Representation","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:34:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Task structure is identifiable across time steps and task-relevant representations are identifiable within steps in a nonparametric setting under sparsity regularization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12651","ref_index":84,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic","primary_cat":"cs.LG","submitted_at":"2026-05-12T18:57:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predicate evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16398","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:13:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VHYDRO is a support-safe variational hybrid filter that jointly recovers continuous latent states, discrete contact modes, and sparse port-Hamiltonian laws per regime while preventing loss of feasible transitions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12289","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PriorZero: Bridging Language Priors and World Models for Decision Making","primary_cat":"cs.LG","submitted_at":"2026-05-12T15:47:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Two factors keep this approximation tight in our setting. First, the LLM prior is injected at the MCTS root (line 7 of Table 2), so πϕ and πMCTS are coupled by construction: the search distribution is a planning-improved version of the LLM prior on the same root state, not an independently sampled behaviour policy. Second, the clipped PPO ratio ρt,j ∈[1−ϵ low,1 +ϵ high] together with the KL regularizer βDKL(πϕ∥πϕref ) bounds the per-update policy drift, which limits how far πϕ can move away from the behaviour distribution that produced the stored advantages, making this a standard near-on-policy regime in PPO-style fine-tuning. Such crucial under Jericho-style sparse rewards, where any additional variance in the policy-gradient signal would be amplified by long"},{"citing_arxiv_id":"2605.11743","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views","primary_cat":"cs.CV","submitted_at":"2026-05-12T08:21:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WorldComp2D explicitly structures latent space geometry by object identity and spatial proximity via a proximity-dependent encoder and localizer, cutting parameters up to 4X and FLOPs 2.2X versus state-of-the-art lightweight models on facial landmark localization while staying real-time on CPU.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18803","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PROWL: Prioritized Regret-Driven Optimization for World Model Learning","primary_cat":"cs.LG","submitted_at":"2026-05-11T14:24:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PROWL introduces a KL-constrained adversarial curriculum and prioritized adversarial trajectory buffer to actively discover and correct rare failure modes in action-conditioned video world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09900","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark","primary_cat":"cs.AI","submitted_at":"2026-05-11T02:44:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KnotBench benchmark shows state-of-the-art VLMs perform near random on diagrammatic knot reasoning tasks and lack ability to simulate structural moves.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09886","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Network-Efficient World Model Token Streaming","primary_cat":"cs.RO","submitted_at":"2026-05-11T02:19:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bitrates for tokenized driving world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09874","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-11T01:59:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09146","ref_index":92,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Thinking: Imagining in 360$^\\circ$ for Humanoid Visual Search","primary_cat":"cs.CV","submitted_at":"2026-05-09T20:10:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08954","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MolWorld: Molecule World Models for Actionable Molecular Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-09T13:50:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MolWorld expands a molecule-transfer graph using a world model to discover high-property molecules that maintain strong structural connectivity to known compounds for actionable optimization.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"for sequence generation models.arXiv preprint arXiv:1705.10843, 2017. [15] Jeff Guo and Philippe Schwaller. Augmented memory: Capitalizing on experience replay to accelerate de novo molecular design.arXiv preprint arXiv:2305.16160, 2023. [16] Wes Gurnee and Max Tegmark. Language models represent space and time. InInternational Conference on Learning Representations, 2024. [17] David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. [18] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning, 2019. 11 [19] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control:"},{"citing_arxiv_id":"2605.08732","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Geometry Beyond Search: Amortizing Planning in World Models","primary_cat":"cs.RO","submitted_at":"2026-05-09T06:36:23+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08578","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari","primary_cat":"cs.LG","submitted_at":"2026-05-09T00:43:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Transformer world models on Atari exhibit game-specific scaling regimes, but joint training on 26 environments produces consistent monotonic gains that improve downstream control policies to a median normalized score of 0.770.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08567","ref_index":4,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models","primary_cat":"cs.CV","submitted_at":"2026-05-09T00:00:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"4× temporal compression outperforms a frame-independent encoder; and (iii) increasing the action-space dimensionality poses a greater learning challenge for the model, but it can also provide richer observational cues and thereby improve generalization for certain scenes. 2 Related Works Action-conditioned W orld Models The idea of learning a model of the environment [ 4] for planning and decision-making has a long history in reinforcement learning. Recently, driven by rapid advances in diﬀusion-based image and video generation [ 6, 14, 29, 30, 11, 34], pixel-space world models have regained signiﬁcant attention for generating high-quality visual predictions conditioned on actions [ 5, 10, 8, 23, 35]. However, most existing works focus on egocentric settings, where actions primarily correspond to navigation,"},{"citing_arxiv_id":"2605.08412","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-08T19:20:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749-3761, 2022. [9] James Gunn, Zygmunt Lenyk, Anuj Sharma, Andrea Donati, Alexandru Buburuzan, John Redford, and Romain Mueller. Lift-attend-splat: Bird's-eye-view camera-lidar fusion using transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4526-4536, 2024. [10] David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018. [11] Michael D Kirchhoff and Julian Kiverstein.Extended consciousness and predictive processing: A third wave view. Routledge, 2019. [12] Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1-62, 2022."},{"citing_arxiv_id":"2605.08019","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners","primary_cat":"cs.AI","submitted_at":"2026-05-08T17:07:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"theory-based modeling, exploration, and planning.arXiv preprint arXiv:2107.12544, 2021. [5] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529-533, 2015. [6] David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018. [7] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. [8] Shengjie Wang, Shaohuai Liu, Weirui Ye, Jiacheng You, and Yang Gao. Efficientzero v2: Mastering discrete and continuous control with limited data."},{"citing_arxiv_id":"2605.07278","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Predictive but Not Plannable: RC-aux for Latent World Models","primary_cat":"cs.LG","submitted_at":"2026-05-08T05:43:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"06088, 2019. 10 [14] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271-21284, 2020. [15] David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018. [16] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. [17] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and"},{"citing_arxiv_id":"2605.07199","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Three-in-One World Model: Energy-Based Consistency, Prediction, and Counterfactual Inference for Marketing Intervention","primary_cat":"cs.AI","submitted_at":"2026-05-08T03:47:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A DBM-based architecture learns consumer beliefs to enable consistent prediction and counterfactual inference for marketing interventions, outperforming baselines on heterogeneous treatment effects in simulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"text sequences, a paradigm that lacks explicit mechanisms for causal reasoning or internal world representation [7]. In contrast to such token-level prediction, world models i.e.,systems that learn an internal representation of en- vironment dynamics to enable prediction, planning, and reasoning, have attracted growing attention as a comple- mentary paradigm [8, 7]. Recent theoretical proposals [7] and empirical advances [8, 9, 10] have positioned world models as a promising path beyond the limitations of au- toregressive generation. This view is echoed in broader discourse: for example, Yann LeCun has described au- toregressive LLMs as \"doomed\" due to their inability to model causality 1, and Demis Hassabis has emphasized"},{"citing_arxiv_id":"2605.07079","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Visual Feature-Based World Models via Residual Latent Action","primary_cat":"cs.CV","submitted_at":"2026-05-08T00:58:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"aspect of RLA is that adding one linear layer to predict z makes the simple framework (Fig. 4) a true world action model, as RLA enables accurate prediction of st+h from st. 4.3 Visual Reinforcement Learning within RLA World Model Motivation. Extracting a robust visuomotor policy from a world model trained on ofﬂine videos re- mains a fundamental challenge [ 58]. Existing methods generally fall into two paradigms: reinforce- ment learning (RL) and planning. In RL, GWM [ 13] improves sample efﬁciency but still requires simulator interaction. UniSim [ 2] uses a video diffusion model as a simulator, but requires massive compute, remains unreleased, and is evaluated on a single task. Conversely, planning with world"},{"citing_arxiv_id":"2605.06500","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Operator-Guided Invariance Learning for Continuous Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:18:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06388","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models","primary_cat":"cs.CV","submitted_at":"2026-05-07T15:05:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 (strongest overall on policy), Web-DINO, and SigLIP 2 gener- ally excel across the other two axes at all model scales. Our study advocates semantic latent space as stronger foundation for policy-relevant robotics diffusion world models. 1 Introduction Action-conditioned video world models are emerging as a practical interface between generative modeling and robotics [20, 70, 10]. Given observation and action histories, they predict future observations and serve as learned proxies for robot-environment interaction when handcrafted simulators are difﬁcult to build [ 58, 15]. Recent works show that such models can support policy evaluation with good correlation to real-world out- comes [62], and policy improvement [82, 75, 52]."},{"citing_arxiv_id":"2605.06298","ref_index":9,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement","primary_cat":"cs.CV","submitted_at":"2026-05-07T14:02:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"periods per unit length. The standard Fourier feature encoding assigns frequency ξk = 2k−1 to band kas in [Tancik et al., 2020] γ(y) k = sin(2kπy),cos(2 kπy) \u0001 , k∈ {0,1, . . . , K−1},(7) or in full, γ(y) = sin(20πy),cos(2 0πy), . . . ,sin(2 K−1 πy),cos(2 K−1 πy) \u0001⊤ ∈R 2K.(8) The no-aliasing conditionξ k < ξNyquist requires: 2k−1 < H 4 =⇒k max <log 2(H)−1.(9) For MiniGrid (H= 72 ), kmax <5.17 ; therefore, all K= 6 bands remain within the safe manifold. However, for WeatherBench (H= 32 ), kmax <4 , meaning that the highest-frequency bands must be eliminated. 12For our analysis, we consider aliasing along a single axisyfor clarity, as the two spatial dimensions are separable. 28 GT t = 1 2 3 4 5 6"},{"citing_arxiv_id":"2605.06247","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-07T13:26:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}