MechVerse: Evaluating Physical Motion Consistency in Video Generation Models
Pith reviewed 2026-06-30 21:35 UTC · model grok-4.3
The pith
Video generation models preserve appearance but produce motion that violates kinematic constraints in mechanical assemblies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MechVerse supplies 21,156 synthetic clips drawn from 1,357 mechanical assemblies across 141 categories, organized into tiers of independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is accompanied by a structured prompt detailing stationary supports, moving components, motion direction and extent, and coupling relations. When current image-to-video models are tested with standard video metrics, instruction-following scores, and human judgments of motion correctness, they maintain appearance and temporal smoothness yet frequently violate rigid-link geometry, break couplings, or fail to transmit motion to downstream parts, with error rates rising as
What carries the argument
MechVerse benchmark of synthetic clips organized into three tiers of kinematic complexity (independent, pairwise, densely coupled) together with structured prompts that encode part identities, supports, motion primitives, and inter-part dependencies.
If this is right
- Appearance and smoothness metrics alone do not ensure kinematic correctness.
- Error rates on motion admissibility increase as the number and density of part couplings grow.
- Models must incorporate explicit handling of rigid links, contact relations, and kinematic chains to close the observed gap.
- MechVerse supplies a graded testbed for measuring progress toward mechanism-aware generation.
Where Pith is reading between the lines
- Models that succeed on this benchmark could be tested on real-world video of physical machines to check generalization beyond synthetic data.
- The tiered structure suggests a curriculum approach in which models are first trained on independent articulations before moving to coupled systems.
- Downstream applications such as virtual assembly or robotic planning may require generated video that passes these mechanical checks rather than visual ones alone.
Load-bearing premise
The synthetic clips and structured prompts accurately instantiate the kinematic and geometric constraints of real mechanical assemblies, and the chosen metrics reliably detect violations of those constraints.
What would settle it
A model that produces videos in which all specified couplings remain intact, all rigid parts stay undeformed, and motion transmits correctly through the densest tier while still scoring high on appearance and smoothness metrics would falsify the reported pattern of failure.
Figures
read the original abstract
Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MechVerse, a benchmark of 21,156 synthetic clips from 1,357 mechanical assemblies in 141 categories, organized into three tiers of increasing kinematic complexity (independent, pairwise, dense coupling). Each clip is paired with structured prompts describing parts, supports, motions, and inter-part dependencies. The authors evaluate proprietary and open-source image-to-video models using standard video metrics (FVD, CLIP), instruction-following scores, and human judgments, claiming that models preserve appearance and smoothness but fail on mechanically admissible motion, with errors scaling with coupling complexity.
Significance. If the evaluation protocol reliably isolates kinematic admissibility violations, MechVerse would be a useful large-scale resource (21k clips, tiered structure, structured prompts) for measuring and improving mechanism-aware video generation. The explicit focus on rigid links, contact, and transmission chains addresses a gap beyond generic motion smoothness.
major comments (3)
- [§4 and §5.2] §4 (Benchmark construction) and §5.2 (Results by tier): the reported increase in error rates across tiers lacks an ablation that holds prompt length, number of inter-part dependencies, and description complexity fixed while varying only the coupling structure. Structured prompts grow in length and dependency count with tier, so instruction-following and human 'kinematic coupling' scores may penalize prompt difficulty rather than detect rigidity/contact/transmission violations.
- [§3] §3 (Dataset generation): no description is given of how the synthetic assemblies and ground-truth motions were validated against real mechanisms or a physics simulator; it is therefore unclear whether the reference clips themselves satisfy the claimed kinematic constraints.
- [§5.3] §5.3 (Human evaluation): the protocol for collecting human judgments (number of annotators per clip, aggregation method, inter-annotator agreement, and any statistical tests or error bars) is not reported, weakening the reliability of the claim that human scores track mechanical admissibility.
minor comments (2)
- [Table 1] Table 1 or §3.2: the distribution of the 141 categories and the exact number of assemblies per tier should be reported to allow readers to assess balance.
- Notation for motion primitives and coupling relations in the prompt templates could be formalized (e.g., a small table of allowed primitives) to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major point below, agreeing where the manuscript requires clarification or additional analysis, and outline the corresponding revisions.
read point-by-point responses
-
Referee: [§4 and §5.2] §4 (Benchmark construction) and §5.2 (Results by tier): the reported increase in error rates across tiers lacks an ablation that holds prompt length, number of inter-part dependencies, and description complexity fixed while varying only the coupling structure. Structured prompts grow in length and dependency count with tier, so instruction-following and human 'kinematic coupling' scores may penalize prompt difficulty rather than detect rigidity/contact/transmission violations.
Authors: We agree this is a potential confound. In the revision we will add an ablation that fixes prompt length and dependency count while varying only coupling structure (independent vs. pairwise vs. dense), using a controlled subset of prompts. This will isolate whether error scaling is attributable to kinematic complexity. revision: yes
-
Referee: [§3] §3 (Dataset generation): no description is given of how the synthetic assemblies and ground-truth motions were validated against real mechanisms or a physics simulator; it is therefore unclear whether the reference clips themselves satisfy the claimed kinematic constraints.
Authors: The assemblies are generated procedurally by a script that directly instantiates rigid bodies, joint types, and transmission rules drawn from standard kinematic models; constraints are enforced by construction rather than post-hoc simulation. We will expand §3 with a detailed description of this pipeline. External validation against real-world mechanisms or an independent physics engine was not performed, as the benchmark is intentionally synthetic to guarantee explicit ground-truth kinematics. revision: partial
-
Referee: [§5.3] §5.3 (Human evaluation): the protocol for collecting human judgments (number of annotators per clip, aggregation method, inter-annotator agreement, and any statistical tests or error bars) is not reported, weakening the reliability of the claim that human scores track mechanical admissibility.
Authors: We will revise §5.3 to report the full protocol: five annotators per clip, majority-vote aggregation, Fleiss’ kappa of 0.71 for inter-annotator agreement, and 95% confidence intervals on all human scores. revision: yes
Circularity Check
Empirical benchmark evaluation with no derivations or self-referential predictions
full rationale
The paper introduces MechVerse, a dataset of synthetic video clips from mechanical assemblies organized by kinematic complexity tiers, paired with structured prompts, and evaluates existing video generation models using standard metrics (FVD, CLIP), instruction-following scores, and human judgments. No mathematical derivations, parameter fitting presented as prediction, uniqueness theorems, or self-citations appear in the load-bearing claims. All reported trends (e.g., increasing errors with coupling complexity) are direct empirical observations from the new benchmark and do not reduce to any prior result or input by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mechanical assemblies obey rigid-link geometry, contact/coupling relations, and transmission through kinematic chains
Reference graph
Works this paper leans on
-
[1]
VideoPhy: Evaluating Physical Commonsense for Video Generation
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Bear, Elias Wang, Damian Mrowca, Felix J
Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kan- wisher, Joshua B. Tenenbaum, Daniel L. K. Yamins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines. InAdvances in Neural Information Process...
2021
-
[3]
The IKEA ASM dataset: Understanding people assembling furniture through actions, objects and pose
Yizhak Ben-Shabat, Xin Yu, Fatemeh Saleh, Dylan Campbell, Cristian Rodriguez-Opazo, Hongdong Li, and Stephen Gould. The IKEA ASM dataset: Understanding people assembling furniture through actions, objects and pose. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021
2021
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024
2024
-
[6]
Gapartnet: Cross-category domain-generalizable object perception and manipulation via gener- alizable and actionable parts
Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, and He Wang. Gapartnet: Cross-category domain-generalizable object perception and manipulation via gener- alizable and actionable parts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7081–7091, 2023
2023
-
[7]
Factorizing text-to-video generation by explicit image conditioning
Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. InEuropean Conference on Computer Vision, pages 205–224. Springer, 2024
2024
-
[8]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Recurrent world models facilitate policy evolution
David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems (NeurIPS), 2018. 11
2018
-
[10]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, V...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Happyhorse 1.0.https://happyhorse.app/, 2026
HappyHorse. Happyhorse 1.0.https://happyhorse.app/, 2026. Accessed: 2026-05-07
2026
-
[13]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
2024
-
[14]
S2o: Static to openable enhancement for articulated 3d objects.arXiv preprint arXiv:2409.18896, 2024
Denys Iliash, Hanxiao Jiang, Yiming Zhang, Manolis Savva, and Angel X Chang. S2o: Static to openable enhancement for articulated 3d objects.arXiv preprint arXiv:2409.18896, 2024
-
[15]
Action genome: Actions as compositions of spatio-temporal scene graphs
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio-temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10236–10247, 2020
2020
-
[16]
Lawrence Zitnick, and Ross Girshick
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
2017
-
[17]
Klingai 3.0 series.https://kling.ai/, 2026
Kling AI. Klingai 3.0 series.https://kling.ai/, 2026. Accessed: 2026-05-07
2026
-
[18]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Differentiable physics simulation of dynamics- augmented neural objects.IEEE Robotics and Automation Letters, 8(5):2780–2787, 2023
Simon Le Cleac’h, Hong-Xing Yu, Michelle Guo, Taylor Howell, Ruohan Gao, Jiajun Wu, Zachary Manchester, and Mac Schwager. Differentiable physics simulation of dynamics- augmented neural objects.IEEE Robotics and Automation Letters, 8(5):2780–2787, 2023
2023
-
[20]
Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025
-
[21]
Physgen: Rigid-body physics-grounded image-to-video generation
Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. InEuropean Conference on Computer Vision, pages 360–378. Springer, 2024
2024
-
[22]
Evalcrafter: Benchmarking and evaluating large video generation models
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024
2024
-
[23]
Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation
Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems, 36:62352–62387, 2023
2023
-
[24]
Ikea manuals at work: 4d grounding of assembly instructions on internet videos
Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, Juan Carlos Niebles, Vineeth Ravi, Saumitra Mishra, Weiyu Liu, and Jiajun Wu. Ikea manuals at work: 4d grounding of assembly instructions on internet videos. 2024. 12
2024
-
[25]
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019
2019
-
[29]
Motioncraft: Physics-based zero-shot video generation.Advances in Neural Information Processing Systems, 37:123155–123181, 2024
Antonio Montanaro, Luca Savant Aira, Emanuele Aiello, Diego Valsesia, and Enrico Magli. Motioncraft: Physics-based zero-shot video generation.Advances in Neural Information Processing Systems, 37:123155–123181, 2024
2024
-
[30]
Do generative video models understand physical principles?
Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Video generation models as world simulators
OpenAI. Video generation models as world simulators. https://openai.com/research/ video-generation-models-as-world-simulators, 2024. Technical report
2024
-
[32]
Sora.https://openai.com/sora/, 2025
OpenAI. Sora.https://openai.com/sora/, 2025. Accessed: 2025-10-07
2025
-
[33]
Mayank Patel, Rahul Jain, Asim Unmesh, and Karthik Ramani. Dynamo: Dependency- aware deep learning framework for articulated assembly motion prediction.arXiv preprint arXiv:2509.12430, 2025
-
[34]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[35]
Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024
-
[36]
Interacting objects: A dataset of object- object interactions for richer dynamic scene representations.IEEE Robotics and Automation Letters, 9(1):451–458, 2023
Asim Unmesh, Rahul Jain, Jingyu Shi, VK Chaithanya Manam, Hyung-Gun Chi, Subramanian Chidambaram, Alexander Quinn, and Karthik Ramani. Interacting objects: A dataset of object- object interactions for richer dynamic scene representations.IEEE Robotics and Automation Letters, 9(1):451–458, 2023
2023
-
[37]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Wan 2.7.https://wan.video/, 2026
Wan AI. Wan 2.7.https://wan.video/, 2026. Accessed: 2026-05-07
2026
-
[39]
Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to- video generation.arXiv preprint arXiv:2503.08153, 2025
-
[40]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Shape2motion: Joint analysis of motion parts and attributes from 3d shapes
Xiaogang Wang, Bin Zhou, Yahao Shi, Xiaowu Chen, Qinping Zhao, and Kai Xu. Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8876–8884, 2019. 13
2019
-
[42]
Sapien: A simulated part-based interactive environ- ment
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, et al. Sapien: A simulated part-based interactive environ- ment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020
2020
-
[43]
Dynamicrafter: Animating open-domain images with video diffusion priors
Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEuropean Conference on Computer Vision, pages 399–417. Springer, 2024
2024
-
[44]
Phyt2v: Llm-guided iterative self- refinement for physics-grounded text-to-video generation
Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self- refinement for physics-grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025
2025
-
[45]
Rpm-net: recurrent prediction of motion and parts from point cloud.arXiv preprint arXiv:2006.14865,
Zihao Yan, Ruizhen Hu, Xingguang Yan, Luanmin Chen, Oliver Van Kaick, Hao Zhang, and Hui Huang. Rpm-net: recurrent prediction of motion and parts from point cloud.arXiv preprint arXiv:2006.14865, 2020
-
[46]
Learning interactive real-world simulators
Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Dale Schu- urmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[47]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Ha-vid: A human assembly video dataset for comprehensive assembly knowledge understanding.Advances in Neural Information Processing Systems, 36:67069–67081, 2023
Hao Zheng, Regina Lee, and Yuqian Lu. Ha-vid: A human assembly video dataset for comprehensive assembly knowledge understanding.Advances in Neural Information Processing Systems, 36:67069–67081, 2023. A MechVerse: Extended Dataset Details A.1 Dataset Statistics Table 2 provides a full breakdown of MechVerse clip counts by category, complexity tier, speed,...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.