Kairos: A Native World Model Stack for Physical AI
Pith reviewed 2026-06-27 03:59 UTC · model grok-4.3
The pith
Kairos introduces a world model stack that learns from mixed embodiment data and maintains states over long horizons with mathematically bounded error via hybrid attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kairos pioneers a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum that sequences heterogeneous experience into a developmental pathway. It maintains the world through a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention handles local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention sustains global memory. Formal theoretical bounds demonstrate that this temporal factorization strictly limits error accumulation and mathematically guarantees state propagation across extended horizons. Deployment-aware system co-design enables low-latency rollout on server and consumer
What carries the argument
Hybrid Linear Temporal Attention that combines sliding-window attention for local dynamics, dilated sliding windows for mid-range dependencies, and gated linear attention for persistent global memory, carrying the theoretical bounds on error accumulation.
If this is right
- Enables low-latency observation-action-feedback loops on consumer-grade hardware.
- Organizes open-world videos, human data, and robot interactions into a single progressive training pathway.
- Delivers top-level results on embodied world-model and long-horizon benchmarks while preserving efficiency.
- Supplies mathematical guarantees for state propagation that support extended physical AI operation.
- Forms an operational foundation for future self-evolving physical intelligence systems.
Where Pith is reading between the lines
- The error-bound approach could reduce reliance on frequent model resets or heavy retraining in deployed robotic systems.
- Integration with existing reinforcement learning loops might improve sample efficiency in policy learning without extra compute scaling.
- Testing the curriculum on additional data modalities could reveal whether the bounds hold when embodiment gaps widen further.
Load-bearing premise
The Hybrid Linear Temporal Attention mechanism with sliding-window, dilated, and gated linear components produces the claimed strict limit on error accumulation and state propagation guarantees.
What would settle it
A controlled long-horizon rollout test on an embodied benchmark that measures whether prediction error exceeds the theoretical bound derived from the temporal factorization.
read the original abstract
World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Kairos, a native world model stack for Physical AI. It features (1) a Cross-Embodiment Data Curriculum for native pre-training on open-world videos, human data, and robot interactions; (2) a Native Unified Architecture with Hybrid Linear Temporal Attention (sliding-window for local dynamics, dilated windows for mid-range, gated linear for global memory) that claims formal theoretical bounds strictly limiting error accumulation and guaranteeing state propagation over long horizons; and (3) Deployment-Aware System Co-Design for low-latency rollouts. Experiments claim top-level performance with strong efficiency trade-offs on embodied world-model, long-horizon, and action-policy benchmarks.
Significance. If the formal bounds are rigorously derived and the benchmark results hold with proper controls and baselines, the work would be significant for providing an integrated, deployment-ready foundation for physical AI that addresses long-horizon state maintenance and efficiency, moving beyond passive visual world models.
major comments (2)
- [Abstract] Abstract: the assertion of 'formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation' and 'mathematically guarantees state propagation across extended horizons' is made without any equations, proof sketch, assumptions (e.g., bounded state norms, properties of the gating function), or derivation steps showing how sliding-window + dilated + gated linear attention produces these properties rather than standard attention; this is load-bearing for the central theoretical claim.
- [Abstract] Abstract (and Experiments section): claims of 'top level performance' and 'strong efficiency-capability trade-off' on embodied, long-horizon, and action-policy benchmarks are stated without metrics, baselines, error bars, dataset sizes, or statistical details, preventing assessment of whether superiority is demonstrated or if results reduce to unstated fitting.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion of 'formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation' and 'mathematically guarantees state propagation across extended horizons' is made without any equations, proof sketch, assumptions (e.g., bounded state norms, properties of the gating function), or derivation steps showing how sliding-window + dilated + gated linear attention produces these properties rather than standard attention; this is load-bearing for the central theoretical claim.
Authors: We agree the abstract states the claim at a high level without derivation details. The full manuscript contains the rigorous derivation in Section 4, including assumptions (bounded state norms, Lipschitz properties of the gating function), the error-bound proof for the hybrid factorization versus standard attention, and the state-propagation guarantee over long horizons. To address the concern, we will revise the abstract to briefly note the key assumptions and reference Section 4 for the complete analysis and proof sketch. revision: partial
-
Referee: [Abstract] Abstract (and Experiments section): claims of 'top level performance' and 'strong efficiency-capability trade-off' on embodied, long-horizon, and action-policy benchmarks are stated without metrics, baselines, error bars, dataset sizes, or statistical details, preventing assessment of whether superiority is demonstrated or if results reduce to unstated fitting.
Authors: The abstract summarizes high-level outcomes; all requested details (specific metrics, baselines, error bars, dataset sizes, and statistical tests) appear in Section 5 with Tables 1–4 and Figures 3–6. We will revise the abstract to include one or two key quantitative results (e.g., relative gains and efficiency metrics) for improved clarity while preserving length constraints. No changes are required in the experiments section. revision: yes
Circularity Check
No circularity: theoretical bounds asserted independently of inputs
full rationale
The paper claims to establish formal theoretical bounds showing that the Hybrid Linear Temporal Attention factorization strictly limits error accumulation and guarantees state propagation. The provided text contains no equations, no derivation steps, no fitted parameters renamed as predictions, and no self-citations used to justify the bounds. Without any exhibited reduction of the claimed result to its own inputs by construction, the derivation chain is self-contained. This is the expected honest non-finding when no load-bearing circular step can be quoted.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning to model the world: A survey of world models in artificial intelligence.TechRxiv, 2026
Jiahua Dong, Qi Lyu, Baichen Liu, Xudong Wang, Wenqi Liang, Duzhen Zhang, Jiahang Tu, Hongliu Li, Hanbin Zhao, Henghui Ding, Yulun Zhang, Zhi Han, Nicu Sebe, Fahad Shahbaz Khan, Salman Khan, Mubarak Shah, Philip Torr, Ming-Hsuan Yang, and Dacheng Tao. Learning to model the world: A survey of world models in artificial intelligence.TechRxiv, 2026
2026
-
[2]
NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jin...
2025
-
[3]
Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...
Pith/arXiv arXiv 2025
-
[4]
V-jepa 2.1: Unlocking dense features in video self-supervised learning
Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning. arXiv preprint arXiv:2603.14482, 2026
Pith/arXiv arXiv 2026
-
[5]
Back to the features: Dino as a foundation for video world models, 2025
Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the features: Dino as a foundation for video world models, 2025
2025
-
[6]
Marble: A multimodal world model.https://marble.worldlabs.ai/, 2025
World Labs. Marble: A multimodal world model.https://marble.worldlabs.ai/, 2025
2025
-
[7]
Teleworld: Towards dynamic multimodal synthesis with a 4d world model, 2025
Yabo Chen, Yuanzhi Liang, Jiepeng Wang, Tingxi Chen, Junfei Cheng, Zixiao Gu, Yuyang Huang, Zicheng Jiang, Wei Li, Tian Li, Weichen Li, Zuoxin Li, Guangce Liu, Jialun Liu, Junqi Liu, Haoyuan Wang, Qizhen Weng, Xuan’er Wu, Xunzhi Xiang, Xiaoyan Yang, Xin Zhang, Shiwen Zhang, Junyu Zhou, Chengcheng Zhou, Haibin Huang, Chi Zhang, and Xuelong Li. Teleworld: T...
2025
-
[8]
Genie 3: A new frontier for world models
Google DeepMind. Genie 3: A new frontier for world models. https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025
2025
-
[9]
Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025
Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025
2025
-
[10]
Lingbot-world: An interactive world model for embodied intelligence.arXiv preprint, January 2026
Robbyant Team. Lingbot-world: An interactive world model for embodied intelligence.arXiv preprint, January 2026. Ant Group Robbyant Technology
2026
-
[11]
Training agents inside of scalable world models, 2025
Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025
2025
-
[12]
Worldmodelbench: Judging video generation models as world models
Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong 61 Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models. arXiv preprint arXiv:2502.20694, 2025
arXiv 2025
-
[13]
Jim Fan, Yoel Jang, Ireayo Akinola, et al. Dreamgen: Unlocking generalization in robot learning through neural trajectories.arXiv preprint arXiv:2505.12705, 2025. Introduces DreamGen Bench, a video generation benchmark for robot learning
Pith/arXiv arXiv 2025
-
[14]
Qwen2.5-vl, January 2025
Qwen Team. Qwen2.5-vl, January 2025
2025
-
[15]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026
2026
-
[16]
Ace-brain-0: Spatial intelligence as a shared scaffold for universal embodiments, 2026
Ziyang Gong, Zehang Luo, Anke Tang, Zhe Liu, Shi Fu, Zhi Hou, Ganlin Yang, Weiyun Wang, Xiaofeng Wang, Jianbo Liu, Gen Luo, Haolan Kang, Shuang Luo, Yue Zhou, Yong Luo, Li Shen, Xiaosong Jia, Yao Mu, Xue Yang, Chunxiao Liu, Junchi Yan, Hengshuang Zhao, Dacheng Tao, and Xiaogang Wang. Ace-brain-0: Spatial intelligence as a shared scaffold for universal emb...
2026
-
[18]
Gated delta networks: Improving mamba2 with delta rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In The Thirteenth International Conference on Learning Representations
-
[19]
AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialun Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yue Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng-Xing Ruan, Jiaqi Shan, Yongjian Shen, Ch...
2025
-
[20]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Y...
2024
-
[21]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023
2023
-
[22]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. 62
2023
-
[23]
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024
Pith/arXiv arXiv 2024
-
[25]
Xiangyu Peng, Zangwei Zheng, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, Wenjun Li, et al. Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025
Pith/arXiv arXiv 2025
-
[26]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
Pith/arXiv arXiv 2025
-
[27]
Decoupled weight decay regularization, 2019
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019
2019
-
[28]
Fusionbench: A comprehensive benchmark of deep model fusion.Journal of MachineLearning Research, 2025
Anke Tang, Li Shen, Yong Luo, Enneng Yang, Han Hu, Lefei Zhang, Bo Du, and Dacheng Tao. Fusionbench: A comprehensive benchmark of deep model fusion.Journal of MachineLearning Research, 2025
2025
-
[29]
Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities.ACM Comput
Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities.ACM Comput. Surv., 58(8), February 2026
2026
-
[30]
Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022
2022
-
[31]
Revisiting weight averaging for model merging
Jiho Choi, Donggyun Kim, Chanhyuk Lee, and Seunghoon Hong. Revisiting weight averaging for model merging. arXiv preprint arXiv:2412.12153, 2024
arXiv 2024
-
[32]
Ties-merging: Resolving interference when merging models, 2023
Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models, 2023
2023
-
[33]
Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024
2024
-
[34]
Whoever started the interference should end it: Guiding data-free model merging via task vectors, 2025
Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, and Chun Yuan. Whoever started the interference should end it: Guiding data-free model merging via task vectors, 2025
2025
-
[35]
Diffusion model alignment using direct preference optimization, 2023
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization, 2023
2023
-
[36]
Manning, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024
2024
-
[37]
Videodpo: Omni-preference alignment for video diffusion generation
Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 63
2025
-
[38]
Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023
2023
-
[39]
Ring attention with blockwise transformers for near-infinite context, 2023
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023
2023
-
[40]
Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content
Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[41]
Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation
Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, and Siyu Zhu. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9365–9374, 2025
2025
-
[42]
Vidgen-1m: A large-scale dataset for text-to-video generation
Zirui Tan, Yandong Li, Yaliang Li, and Jingren Zhou. Vidgen-1m: A large-scale dataset for text-to-video generation. arXiv preprint arXiv:2408.02629, 2024
arXiv 2024
-
[43]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision (ECCV), pages 402–419. Springer, 2020
2020
-
[44]
Fine-tuned vision transformer for nsfw image classification
FalconsAI Team. Fine-tuned vision transformer for nsfw image classification. Hugging Face Model Hub,
-
[45]
Initial commit 2023-10-14, Last updated 2025-04-06, Apache-2.0 License, 80k training images, 98.04% accuracy, 85.8M params
2023
-
[46]
Yolox: Exceeding yolo series in 2021
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021
Pith/arXiv arXiv 2021
-
[47]
YifuZhang, PeizeSun, YiJiang, DongdongYu, ZehuanYuan, PingLuo, WenyuLiu, andXinggangWang. Bytetrack: Multi-object tracking by associating every detection box.arXiv preprint arXiv:2110.06864, 2021
arXiv 2021
-
[48]
Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. Real-time scene text detection with differentiable binarization.arXiv preprint arXiv:1911.08947, 2019
arXiv 1911
-
[49]
Qwen3 technical report, 2025
Qwen Team. Qwen3 technical report, 2025
2025
-
[50]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
Pith/arXiv arXiv 2025
-
[51]
Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining, 2025
Xiaomi LLM-Core Team. Mimo: Unlocking the reasoning potential of language model – from pretraining to posttraining, 2025
2025
-
[52]
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025
Pith/arXiv arXiv 2025
-
[53]
Seedance 2.0: Advancing video generation for world complexity
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148, apr 2026
Pith/arXiv arXiv 2026
-
[54]
Freeman, and Taesung Park
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation, 2024
2024
-
[55]
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis, 2024
2024
-
[56]
Consistency models, 2023
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023. 64
2023
-
[57]
Simplifying, stabilizing and scaling continuous-time consistency models, 2025
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models, 2025
2025
-
[58]
Large scale diffusion distillation via score-regularized continuous-time consistency, 2026
Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency, 2026
2026
-
[59]
Pai-bench: A compre- hensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025
Fengzhe Zhou, Jiannan Huang, Jialuo Li, Deva Ramanan, and Humphrey Shi. Pai-bench: A compre- hensive benchmark for physical ai.arXiv preprint arXiv:2512.01989, 2025
arXiv 2025
-
[60]
Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Y...
2026
-
[61]
Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang 65 Zeng, Junjin Xiao, Xinyuan Chang, et al. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026
arXiv 2026
-
[62]
Gigaworld-0: World models as data engine to empower embodied ai, 2025
GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. Gigaworld-0: World models as data engine to em...
2025
-
[63]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21807–21818. IEEE, 2024
2024
-
[64]
Wan 2.5: Open-source ai video generation with audio.https://wan.video, 2025
Alibaba Tongyi Lab. Wan 2.5: Open-source ai video generation with audio.https://wan.video, 2025. Official product page
2025
-
[65]
Veo 3.1.https://deepmind.google/models/veo/, 2025
Google DeepMind. Veo 3.1.https://deepmind.google/models/veo/, 2025. Official model page
2025
-
[66]
Wow: Towards a world omniscient world model through embodied interaction, 2025
Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou...
2025
-
[67]
Sora 2 pro.https://platform.openai.com/docs/models/sora-2-pro, 2025
OpenAI. Sora 2 pro.https://platform.openai.com/docs/models/sora-2-pro, 2025. Official model documentation, accessed 2026-06-08
2025
-
[68]
Unifolm-wma-0: A world-model-action (wma) framework under the unifolm family
Unitree Robotics. Unifolm-wma-0: A world-model-action (wma) framework under the unifolm family. https://huggingface.co/unitreerobotics/UnifoLM-WMA-0-Base, 2025. Hugging Face model card, accessed June 2026
2025
-
[69]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025
2025
-
[70]
Qwen3.5-2B
Qwen Team. Qwen3.5-2B. https://huggingface.co/Qwen/Qwen3.5-2B, 2025. Model card and benchmark results. Accessed: 2026-06-10
2025
-
[71]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
2024
-
[72]
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025
2025
-
[73]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024
2024
-
[74]
Grok-1.5 Vision Preview.https://x.ai/blog/grok-1.5v, 2024
xAI. Grok-1.5 Vision Preview.https://x.ai/blog/grok-1.5v, 2024. Accessed: 2026-06-10
2024
-
[75]
Are we on the right way for evaluating large vision-language models?, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models?, 2024. 66
2024
-
[76]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
Pith/arXiv arXiv 2025
-
[77]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...
2026
-
[78]
X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025
2025
-
[79]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...
2025
-
[80]
StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026
Pith/arXiv arXiv 2026
-
[81]
Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026
Pith/arXiv arXiv 2026
-
[82]
A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692v1, 2026
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Shuailei Ma, He Sun, Yong Wang, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Shuai Zhou, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Qian Zhu, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692v1, 2026
Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.