{"total":37,"items":[{"citing_arxiv_id":"2606.28757","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models","primary_cat":"cs.CV","submitted_at":"2026-06-27T06:13:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CrashTwin is a new benchmark framework that exposes physical violations in state-of-the-art world models during multi-agent collisions despite high visual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22918","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation","primary_cat":"cs.CV","submitted_at":"2026-06-22T06:58:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JudgeFit produces per-VLM physical video evaluation taxonomies that improve held-out accuracy by a mean 32% relative to a single global schema across 16 models from eight families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20545","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Current World Models Lack a Persistent State Core","primary_cat":"cs.CV","submitted_at":"2026-06-18T17:55:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Current world models fail to evolve internal state when unobserved and instead resume scenes at the last observed state, as diagnosed by the new WRBench benchmark across 23 models and 9600 videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18943","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Physics-IQ Verified","primary_cat":"cs.CV","submitted_at":"2026-06-17T11:23:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Physics-IQ Verified refines 57.6% of samples and 34.8% of prompts from the original benchmark and produces moderate ranking shifts (Kendall's τ = 0.46) across six image-to-video models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04811","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?","primary_cat":"cs.CV","submitted_at":"2026-06-03T12:35:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04737","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment","primary_cat":"cs.CV","submitted_at":"2026-06-03T11:20:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PILA aligns frozen flow-matching video models to a physics attribute bank via MoE experts and operational residuals, reporting SOTA physical plausibility on VBench-2.0, VideoPhy-2 and PhyGenBench while preserving visual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01538","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics","primary_cat":"cs.GR","submitted_at":"2026-06-01T01:36:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Assembles MPM simulation dataset and compares code generation versus video diffusion for inferring physical parameters and extrapolating dynamics from videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00793","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MBench: A Comprehensive Benchmark on Memory Capability for Video World Models","primary_cat":"cs.CV","submitted_at":"2026-05-30T16:17:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30542","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Physically Viable World Models: A Case for Query-Conditioned Embodied AI","primary_cat":"cs.AI","submitted_at":"2026-05-28T20:18:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30346","ref_index":81,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"YoCausal: How Far is Video Generation from World Model? A Causality Perspective","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25874","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation","primary_cat":"cs.CV","submitted_at":"2026-05-25T14:01:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24962","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tempered Self-Similarity Alignment for Physically Plausible Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-24T09:28:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Tempered Self-similarity Alignment transfers relational structure from foundation-model STSS into video generators via probabilistic correspondence alignment, yielding reported gains in physical plausibility on VideoPhy benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23699","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models","primary_cat":"cs.CV","submitted_at":"2026-05-22T14:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19242","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhyWorld: Physics-Faithful World Model for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-19T01:28:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18396","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NEWTON: Agentic Planning for Physically Grounded Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T13:42:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15185","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Quantitative Video World Model Evaluation for Geometric-Consistency","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15116","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DriveCtrl: Conditioned Sim-to-Real Driving Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:29:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DriveCtrl is a depth-conditioned controllable framework that generates realistic driving videos from simulation while preserving annotations and scene dynamics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14843","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MechVerse: Evaluating Physical Motion Consistency in Video Generation Models","primary_cat":"cs.CV","submitted_at":"2026-05-14T13:48:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MechVerse benchmark shows current video generation models preserve appearance but fail at mechanically admissible motion, with errors rising as coupling complexity increases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14269","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T02:12:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than prior 2D rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12090","ref_index":213,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"Physical Commonsense VideoPhy [198], PhyGenBench [199], VBench-2.0 [ 200], WorldModelBench [201] Physics-IQ [202], WorldScore [203], EWMBench [ 204] Action Plausibility WorldSimBench [205], Wow , wo, val! [206] Action Policy General MetaWorld [207], RLBench [ 208], Robomimic [209], Franka Kitchen [ 210], ManiSkill [ 211] ManiSkill2 [151], ManiSkill3 [ 212], RoboCasa [152], CAL VIN [213], VIMAbench [214] VLMbench [215], LIBERO [216], Libero-plus [ 4], Libero-pro [ 217], Libero-X [ 218] COLOSSEUM [219], AGNOSTOS [220], RoboEval [221], RoboVerse [222], PolaRiS [223] RoboMME [224], GenManip [ 225], VLABench [ 226], RoboSuite [227], RoboLab [228] SimplerEnv [229], ARNOLD [230], GemBench [231] Bimanual and Humanoid Form Robo T win [153], BiGym [232], HumanoidBench [ 233]"},{"citing_arxiv_id":"2605.10806","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PhyGround: Benchmarking Physical Reasoning in Generative World Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T16:30:51+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"All benchmark prompts, human annotations, model checkpoints, and evaluation code will be publicly released. 2 Related Work Physics-focused Benchmarks for Video Generation.General-purpose video benchmarks such as Eval- Crafter [25] and VBench [ 20] rely on FVD, SSIM, and CLIP-style metrics that measure visual fidelity rather than physical correctness. Recently, a growing number of physics-focused benchmarks have emerged to address this gap[26, 44, 12, 47, 48, 45, 6, 37, 46, 36]. VideoPhy [3] and VideoPhy-2 [4] use a binary followed/violated rubric that collapses mild and severe violations, with about ten annotators. PhyGenBench [26] curates only 160 manually crafted prompts and assumes each video corresponds to only one physical law. Physics-IQ [28] grades 66 scenarios with reference-based pixel metrics that presume a unique ground-truth continuation and do not localize per-law"},{"citing_arxiv_id":"2605.10434","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors","primary_cat":"cs.CV","submitted_at":"2026-05-11T12:06:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and aesthetics/compositionality suites [7, 30, 12, 13, 19], none of which provide structured reasoning verification. Reasoning-oriented benchmarks each cover one slice-embodied task-success [ 18], small-scale answer-verifiable puzzles [14, 11], procedural process-aware tasks [ 10], single-event causality with Likert ratings [25], physical-law or rule-governed transitions [16, 5], and video un- derstanding rather than generation [24]. VLM-as-Judge pipelines [ 31, 15, 4] scale evaluation but single-pass judges over-reward visual plausibility and miss process-level errors. WorldReasonBench instead pairs an initial image with a text instruction to probe open-domain future-state evolution, annotates each case with 5-7 QA pairs across four reasoning phases (state, process, fidelity, mecha-"},{"citing_arxiv_id":"2605.07061","ref_index":28,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do Joint Audio-Video Generation Models Understand Physics?","primary_cat":"cs.SD","submitted_at":"2026-05-08T00:14:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AV-Phys Bench shows that current joint audio-video models lack robust physical commonsense, with major drops on transitions and deliberate anti-physics prompts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"maintaining physically consistent audio-visual behavior is essential. Evaluating joint audio-video generation, therefore, requires a benchmark that goes beyond perceptual quality and semantic alignment to test whether audio, video, and their interaction remain physically consistent. Existing benchmarks have made important progress on related aspects. PhysicsIQ [28], PhyGenBench [27], VideoPhy-2 [3], and PhyWorldBench [12] focus on physical realism in video, while PhyA VBench [39] examines whether generated audio responds appropriately to controlled changes in material, force, and environment. TA VGBench [26], JavisBench [22], V ABench [17], and SA VGBench [31] instead evaluate semantic, temporal, and spatial alignment in joint audio-video"},{"citing_arxiv_id":"2604.19193","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Far Are Video Models from True Multimodal Reasoning?","primary_cat":"cs.CV","submitted_at":"2026-04-21T08:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Consequently,generationandediting benchmarks [21,29,55,60] established aesthetic and instruction-following frame- works. Although recent works [32,87] support reference-guided editing, they are often limited to single-shot scenarios and lack cross-shot consistency. More recently, the field has gravitated toward physical and spatial grounding. Bench- marks such as PhyGenBench [48], RBench-V [17], and SpatialViz-Bench [69] evaluate physical commonsense and 3D spatial relationships. Despite these ad- vances, existing frameworks typically evaluate tasks in isolation, lacking a unified paradigm for multimodal context video generation. 2.3 Video Evaluation Methods Existing video generation evaluation methodologies can be broadly categorized"},{"citing_arxiv_id":"2604.15299","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AnimationBench: Are Video Models Good at Character-Centric Animation?","primary_cat":"cs.CV","submitted_at":"2026-04-16T17:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07990","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations","primary_cat":"cs.CV","submitted_at":"2026-04-09T08:59:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[33] Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10486-10496, 2025. 2, 3, 4, 5, 8 [34] Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quan- feng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024. 3 [35] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam"},{"citing_arxiv_id":"2603.21743","ref_index":31,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-03-23T09:33:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CellFluxRL post-trains the CellFlux model with RL using seven biological reward functions to generate more biologically valid virtual cell images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.13294","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-02-09T05:46:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13609","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Do-Undo Bench: Reversibility for Action Understanding in Image Generation","primary_cat":"cs.CV","submitted_at":"2025-12-15T18:03:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Do-Undo Bench is a new evaluation task and dataset that forces models to simulate forward action effects and then undo them to measure genuine action understanding in image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13281","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?","primary_cat":"cs.CV","submitted_at":"2025-12-15T12:41:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01843","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models","primary_cat":"cs.CV","submitted_at":"2025-12-01T16:28:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.21002","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SURF: Signature-Retained Fast Video Generation","primary_cat":"cs.GR","submitted_at":"2025-11-25T18:54:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SURF accelerates high-resolution video generation up to 12.5x by using noise reshifting for low-res previews from pretrained models and a shifting-window Refiner for efficient upscaling that retains original signatures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.20206","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2025-10-23T04:45:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24702","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility","primary_cat":"cs.CV","submitted_at":"2025-09-29T12:32:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.05635","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2025-08-07T17:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.13211","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MAGI-1: Autoregressive Video Generation at Scale","primary_cat":"cs.CV","submitted_at":"2025-05-19T14:58:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In practice, the smallest s utilized is 1/64, corresponding to the standard flow-matching inference setting that requires 64 function evaluations. When training with this minimal step size, we incorporate classifier-free guidance (CFG) distillation (Meng et al., 2023) (see Sec. 2.4.1 for details). The step size s for distillation is cyclically sampled from the set [1/64] × 8 ∪ [1/32, 1/16, 1/8]. This sampling strategy enables a single distilled model to perform denoising with different computational budgets (64, 32, 16, or 8 steps), thus providing flexibility to dynamically balance generation quality and inference efficiency at test time. 2.4 Inference Approach 2.4.1 Diffusion Guidance Classifier-free guidance (Ho & Salimans, 2022), a widely adopted low-temperature sampling"},{"citing_arxiv_id":"2503.21755","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness","primary_cat":"cs.CV","submitted_at":"2025-03-27T17:57:01+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Cons), adherence to simple prompts (Simp Pmpt), compositional creativity (Comp Crea), commonsense reasoning (Com Sense), physics-based realism (Phy), human anatomy (Human Anat), and adherence to complex prompts (Cplx Pmpt). Superficial Faithfulness Intrinsic Faithfulness Frame Wise Temp Cons Simp Pmpt Comp Crea Com Sense Phy Human Anat Cplx Pmpt VBench [25], [26] ✓ ✓ ✓ T2V-CompBench [73] ✓ ✓ ✓ PhyGenBench [72] ✓ StoryEval [74] ✓ ✓ VBench-2.0 (Ours) ✓ ✓ ✓ ✓ ✓ ✓ way for the development of next-generation video foundation models [20], [29]-[31], [59], [61]-[66] that achieve remark- able visual quality and robust spatiotemporal coherence. They have shifted focus toward enhancing video generation adhere to deeper principles such as physical laws and commonsense"}],"limit":50,"offset":0}