{"total":12,"items":[{"citing_arxiv_id":"2606.02564","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization","primary_cat":"cs.CV","submitted_at":"2026-06-01T17:54:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24962","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tempered Self-Similarity Alignment for Physically Plausible Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-24T09:28:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Tempered Self-similarity Alignment transfers relational structure from foundation-model STSS into video generators via probabilistic correspondence alignment, yielding reported gains in physical plausibility on VideoPhy benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23878","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-22T17:34:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23345","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models","primary_cat":"cs.CV","submitted_at":"2026-05-22T08:06:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12138","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T13:58:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1, 2, 3 [62] Haohan Wang, Wei Feng, Yaoyu Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Zhangang Lin, and Jingping Shao. Gen- erate e-commerce product background by integrating cate- gory commonality and personalized style. InICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1-5. IEEE, 2025. 2 [63] Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025. 4 [64] Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont- Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah"},{"citing_arxiv_id":"2604.18564","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MultiWorld: Scalable Multi-Agent Multi-View Video World Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T17:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Beyond text-to-video synthesis [3,21,41,46,60], interactive video generation [63] that responds to interactive control signals has evolved rapidly. Existing mod- els incorporate various signals like camera controls [13,36,52,65] and action controls [7,10,12,35] to simulate future states. Recent studies have explored sev- eral essential properties [16] of interactive video world models, such as physical consistency [37,47,49,72], and long-horizon coherence [53,56,62], alongside effi- cient real-time generation [17,61,66,75] to enable practical deployment. With these properties, world models can serve as powerful simulators for downstream tasks like game generation [39,55], embodied AI [6,27], and autonomous driv- ing [33,58]. Game video world models [44,64] control the environment and simu-"},{"citing_arxiv_id":"2604.11804","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-13T17:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940-14950, 2025. 15 [54] Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. [55] Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025. [56] Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, and Shujun Wang. Language model based text-to-audio generation: Anti-causally aligned collaborative residual transformers."},{"citing_arxiv_id":"2604.09415","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhysInOne: Visual Physics Learning and Reasoning in One Suite","primary_cat":"cs.CV","submitted_at":"2026-04-10T15:27:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Datasets for Testing Physical Understanding: Recent advances of large language, vision, multimodal, video, and world models have led to a series of benchmarking datasets to test the physical understanding abilities of these mod- els, including ComPhy [18], PerceptionTest [69], TraySim [19], GRASP [44], PhyBench [61], Physics-IQ [63], Mor- pheus [97], WorldModelBench [51], WISA [84], VBench- 2.0 [102], PhysBench [20], DynSuperCLEVR [85], Video- Phy [6], VideoPhy-2 [ 7], PhyX [ 76], IntPhys2 [ 12], Phy- GenBench [ 62], UGPhysics [ 89], STI-Bench [ 56], Phy- WorldBench [37], PisaBench [ 50], NewtonGen [ 94], and NewtonBench-60K [49]. These benchmarks typically focus on text-image-video question answering (QA) or visual un- derstanding and generation in narrow scenarios, and thus"},{"citing_arxiv_id":"2604.07348","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MoRight: Motion Control Done Right","primary_cat":"cs.CV","submitted_at":"2026-04-08T17:59:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Commonsense (PC) and Semantic Adherence (SA) from VideoPhy [4], both 5-point scores normalized to[0,1]. All evaluations are conducted at 480p resolution. Evaluation Datasets.We evaluate on three datasets spanning diverse interaction scenarios. DynPose-100K [61] is an in-the-wild dataset with highly dynamic camera motion; we manually select 50 videos exhibiting strong viewpoint changes and clear object interactions. WISA [69] is a large-scale physical-dynamics dataset; we select 50 videos from categories including collision, deformation, elasticity, liquid, and rigid-body motion. We further collect 50 real-world cooking videos, featuring complex hand-object interactions. 8 MoRight: Motion Control Done Right forward reasoninginverse reasoning zoom-inorbit-downorbit-right"},{"citing_arxiv_id":"2603.09283","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Ideal to Real: Stable Video Object Removal under Imperfect Conditions","primary_cat":"cs.CV","submitted_at":"2026-03-10T07:07:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SVOR achieves stable, shadow-free video object removal under real-world imperfections via MUSE mask handling, DA-Seg localization, and curriculum training on real and synthetic data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01843","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models","primary_cat":"cs.CV","submitted_at":"2025-12-01T16:28:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.24702","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility","primary_cat":"cs.CV","submitted_at":"2025-09-29T12:32:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}