Recognition: 1 theorem link
Video models are zero-shot learners and reasoners
Pith reviewed 2026-05-14 02:11 UTC · model grok-4.3
The pith
Generative video models like Veo 3 perform zero-shot object segmentation, edge detection, physics understanding, affordance recognition, tool simulation, and early visual reasoning such as maze and symmetry solving.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Veo 3 solves a broad variety of tasks it wasn't explicitly trained for, including segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more; these abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving, indicating that video models are on a path to becoming unified, generalist vision foundation models.
What carries the argument
The generative video model Veo 3, which uses its web-scale video training to produce emergent zero-shot perception, modeling, and manipulation of visual scenes.
If this is right
- Many task-specific vision models could be replaced by a single video model for segmentation, editing, and basic reasoning.
- Further scaling of video models should produce stronger visual reasoning without new task-specific training.
- Video models could serve as the core for unified systems that both generate and understand visual worlds.
- Physical-world interaction skills such as tool use simulation become available without separate robotics training.
Where Pith is reading between the lines
- If the pattern holds, video models may reduce the need for separate datasets and architectures for each vision sub-task.
- The same emergence might appear in other large generative models trained on image or 3D data.
- Robotics and simulation environments could directly query video models for planning and affordance checks.
- Benchmarks that test novel physical reasoning in video sequences would provide clearer tests of these claims.
Load-bearing premise
The shown capabilities are performed in a genuinely zero-shot way with no task information hidden in prompts, no data contamination, and no post-hoc selection of successful cases.
What would settle it
Run Veo 3 on a fresh set of tasks with no possible overlap in common video training data, using fixed neutral prompts that give no hints, and compare success rates against random guessing or non-video baselines.
read the original abstract
The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the generative video model Veo 3 exhibits emergent zero-shot capabilities across diverse vision tasks—including object segmentation, edge detection, image editing, physical property understanding, affordance recognition, tool-use simulation, and early visual reasoning such as maze navigation and symmetry solving—without explicit task-specific training, suggesting that large-scale video generative pretraining can yield generalist vision foundation models analogous to LLMs.
Significance. If substantiated with quantitative controls, the result would indicate that video-scale generative pretraining can induce broad perceptual and reasoning abilities from web-scale data alone, potentially shifting vision modeling toward unified foundation models and opening avenues for zero-shot visual agents.
major comments (3)
- [Abstract] Abstract and results demonstrations: the central zero-shot claim rests entirely on curated qualitative examples with no reported aggregate success rates, error bars, or fixed task-suite metrics, making it impossible to assess generality or rule out selection bias.
- [Results] Results section on task demonstrations: no ablation of prompt phrasing, no decontamination protocol for test videos against training data, and no baseline comparisons are provided, so the interpretation that capabilities arise purely from generative pretraining rather than implicit task specification cannot be verified.
- [Visual Reasoning] Section on visual reasoning examples (maze and symmetry): without quantitative evaluation or controls for prompt leakage, the claim that these constitute emergent reasoning remains unsupported by the presented evidence.
minor comments (2)
- [Figures] Figure captions should explicitly state the exact conditioning text used for each demonstration to allow reproducibility assessment.
- [Discussion] The manuscript would benefit from a dedicated limitations paragraph discussing risks of data contamination and prompt sensitivity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical claims. We address each major comment below. Where feasible, we have revised the manuscript to add clarifications, additional examples, and discussions of limitations; however, some aspects are constrained by the proprietary nature of Veo 3.
read point-by-point responses
-
Referee: [Abstract] Abstract and results demonstrations: the central zero-shot claim rests entirely on curated qualitative examples with no reported aggregate success rates, error bars, or fixed task-suite metrics, making it impossible to assess generality or rule out selection bias.
Authors: We acknowledge the value of quantitative metrics for assessing generality. Our initial demonstrations follow the qualitative style of early LLM papers that first illustrated emergent abilities before standardized benchmarks existed. In the revision, we have expanded the results to include a wider variety of examples (including documented failure cases) and added a dedicated limitations paragraph discussing selection bias and the absence of a fixed task suite. We note that developing aggregate metrics and error bars would require a new standardized benchmark, which we flag as future work rather than claiming current results are exhaustive. revision: partial
-
Referee: [Results] Results section on task demonstrations: no ablation of prompt phrasing, no decontamination protocol for test videos against training data, and no baseline comparisons are provided, so the interpretation that capabilities arise purely from generative pretraining rather than implicit task specification cannot be verified.
Authors: We have added a prompt-phrasing ablation in the supplementary material, testing multiple rewordings for several tasks to demonstrate robustness. Baseline comparisons to smaller open video models have been included where direct equivalents exist. However, a full decontamination protocol is not possible without access to Veo 3's proprietary training corpus; we have added an explicit limitations discussion acknowledging this constraint and arguing that the demonstrated tasks involve novel compositions unlikely to appear verbatim in web-scale data. revision: partial
-
Referee: [Visual Reasoning] Section on visual reasoning examples (maze and symmetry): without quantitative evaluation or controls for prompt leakage, the claim that these constitute emergent reasoning remains unsupported by the presented evidence.
Authors: We agree that quantitative support strengthens the reasoning claim. The revised manuscript now includes a small-scale quantitative evaluation: repeated trials on varied mazes with reported success rates across difficulty levels. We have also added explicit controls for prompt leakage by documenting all prompts used, testing paraphrased variants, and including these results in the main text. These changes provide measurable evidence beyond single curated examples. revision: yes
- A complete decontamination protocol against Veo 3's proprietary training data cannot be performed without access to the closed training corpus.
Circularity Check
No circularity: purely empirical demonstrations without derivations or self-referential reductions
full rationale
The paper contains no mathematical derivations, equations, fitted parameters, or predictive claims that reduce to inputs by construction. It consists solely of qualitative visual examples showing Veo 3 performing tasks such as segmentation, edge detection, affordance recognition, and maze solving. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises that would create circularity. The central claims rest on selected demonstrations rather than any chain that equates outputs to prior fitted quantities or self-defined relations, rendering the work self-contained against the defined circularity criteria.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large generative models trained on web-scale data develop emergent capabilities beyond their training objective.
Forward citations
Cited by 23 Pith papers
-
ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes
ViPS distills a compact, controllable distribution of valid joint configurations for any auto-rigged mesh from video diffusion priors, matching 4D-trained methods in plausibility while generalizing zero-shot to unseen...
-
Progressive Photorealistic Simplification
Progressive semantic image simplification uses VLMs and a verifier to iteratively remove and inpaint scene elements while preserving photorealism, distilled into an image-to-video model for direct sequence prediction.
-
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
-
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
-
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Eulerian adjacent-frame motion fields with bidirectional cycle consistency checks enable faster parallel training and fewer artifacts in diffusion model image animation compared to initial-frame Lagrangian guidance.
-
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency improves training speed, temporal coherence, and artifact reduction in diffusion-based image animation.
-
Grokking of Diffusion Models: Case Study on Modular Addition
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
-
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models
GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
-
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
-
Do multimodal models imagine electric sheep?
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
-
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency yields faster training and more coherent diffusion-based image animation than first-frame reference methods.
-
Open-Source Image Editing Models Are Zero-Shot Vision Learners
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
-
Image Generators are Generalist Vision Learners
Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning
VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and ...
-
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
-
Motif-Video 2B: Technical Report
Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.
-
Neural Computers
Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
Reference graph
Works this paper leans on
-
[1]
A Survey on Large Language Models for Code Generation
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.arXiv preprint arXiv:2406.00515, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Weaver: Foundation models for creative writing
Tiannan Wang, Jiamin Chen, Qingrui Jia, Shuai Wang, Ruoyu Fang, Huilin Wang, Zhaowei Gao, Chunzhao Xie, Chuou Xu, Jihong Dai, et al. Weaver: Foundation models for creative writing. arXiv preprint arXiv:2401.17268, 2024
-
[4]
Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis.arXiv preprint arXiv:2304.04675, 2023
-
[5]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.arXiv preprint arXiv:2501.04227, 2025
-
[7]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[8]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
A Survey on In-context Learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2022. 10 Video models are zero-shot learners and reasoners
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35: 22199–22213, 2022
work page 2022
-
[11]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023
work page 2023
-
[12]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
You only look once: Unified, real-time object detection
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016
work page 2016
-
[14]
YOLOv11: An Overview of the Key Architectural Enhancements
Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, and Paolo Favaro. From generation to generalization: Emergent few-shot learning in video diffusion models.arXiv preprint arXiv:2506.07280, 2025
-
[16]
Taskonomy: Disentangling task transfer learning
Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018
work page 2018
-
[17]
Yijing Lin, Mengqi Huang, Shuhan Zhuang, and Zhendong Mao. Realgeneral: Unifying visual generation via temporal in-context learning with video models.arXiv preprint arXiv:2503.10406, 2025
-
[18]
Visualcloze: A universal image generation framework via visual in-context learning
Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, and Ming-Ming Cheng. Visualcloze: A universal image generation framework via visual in-context learning. arXiv preprint arXiv:2504.07960, 2025
-
[19]
Images speak in images: A generalist painter for in-context visual learning
Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023
work page 2023
-
[20]
Test- time visual in-context tuning
Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, and Bernt Schiele. Test- time visual in-context tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19996–20005, 2025
work page 2025
-
[21]
Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, et al. Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024
-
[22]
Omnigen: Unified image generation
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025
work page 2025
-
[23]
One diffusion to generate them all
Duong H Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, and Jiasen Lu. One diffusion to generate them all. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2671–2682, 2025
work page 2025
-
[24]
Dreamix: Video diffusion models are general video editors
Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, and Yedid Hoshen. Dreamix: Video diffusion models are general video editors.arXiv preprint arXiv:2302.01329, 2023. 11 Video models are zero-shot learners and reasoners
-
[25]
Scalingproperties of diffusion models for perceptual tasks
RahulRavishankar,ZeeshanPatel,JathushanRajasegaran,andJitendraMalik. Scalingproperties of diffusion models for perceptual tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12945–12954, 2025
work page 2025
-
[26]
Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024
Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schuurmans. Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024
-
[27]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[28]
Large language models are human-level prompt engineers
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022
work page 2022
-
[29]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023
work page 2023
-
[30]
Google Cloud. Vertex AI Veo Prompt Rewriter.https://cloud.google.com/vertex-ai/ generative-ai/docs/video/turn-the-prompt-rewriter-off#prompt-rewriter ,
-
[31]
Accessed: September 22, 2025
work page 2025
-
[32]
Lmsys org text-to-video leaderboard.https://lmarena.ai/leaderboard/t ext-to-video, September 2025
LMSYS ORG. Lmsys org text-to-video leaderboard.https://lmarena.ai/leaderboard/t ext-to-video, September 2025. Accessed: 2025-09-23
work page 2025
-
[33]
Google. Veo 2 announcement.https://blog.google/technology/google-labs/vide o-image-generation-update-december-2024/, 2024. Accessed: September 22, 2025
work page 2024
-
[34]
Google. Veo 2 launch.https://developers.googleblog.com/en/veo-2-video-gen eration-now-generally-available/, 2025. Accessed: September 22, 2025
work page 2025
-
[35]
Veo 3 announcement.https://blog.google/technology/ai/generative-m edia-models-io-2025/, 2025
Google. Veo 3 announcement.https://blog.google/technology/ai/generative-m edia-models-io-2025/, 2025. Accessed: September 22, 2025
work page 2025
-
[36]
Google. Veo 3 launch. https://cloud.google.com/blog/products/ai-machine-l earning/veo-3-fast-available-for-everyone-on-vertex-ai , 2025. Accessed: September 22, 2025
work page 2025
-
[37]
Holistically-nested edge detection
Saining Xie and Zhuowen Tu. Holistically-nested edge detection. InProceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015
work page 2015
-
[38]
Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. IntPhys: A framework and benchmark for visual intuitive physics reasoning.arXiv preprint arXiv:1803.07616, 2018
-
[39]
Bear, Elias Wang, Damian Mrowca, Felix J
Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Yamins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines, 2021
work page 2021
-
[40]
Benchmarking progress to infant-level physical reasoning in ai
Luca Weihs, Amanda Yuile, Renée Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mottaghi, and Aniruddha Kembhavi. Benchmarking progress to infant-level physical reasoning in ai. Transactions on Machine Learning Research, 2022
work page 2022
-
[41]
Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. GRASP: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023. 12 Video models are zero-shot learners and reasoners
-
[42]
Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Josh Tenenbaum, Dan Yamins, Judith Fan, and Kevin Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[43]
Videophy: Evaluating physical commonsense for video generation, 2024
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chen- fanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation, 2024
work page 2024
-
[44]
Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. LLMPhy: Complex physical reasoning using large language models and world models.arXiv preprint arXiv:2411.08027, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Towards world simulator: Crafting physical commonsense- based benchmark for video generation, 2024
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation, 2024
work page 2024
-
[46]
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024
-
[47]
Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025
Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025
-
[48]
Generative physical AI in vision: A survey.arXiv preprint arXiv:2501.10928, 2025
Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, and Chang Xu. Generative physical AI in vision: A survey.arXiv preprint arXiv:2501.10928, 2025
-
[49]
Visual cognition in multimodal large language models.Nature Machine Intelligence, pages 1–11, 2025
Luca M Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. Visual cognition in multimodal large language models.Nature Machine Intelligence, pages 1–11, 2025
work page 2025
-
[50]
Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios Vozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, and Efstratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025
-
[51]
Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025
-
[52]
Anand Bhattad, Konpat Preechakul, and Alexei A Efros. Visual jenga: Discovering object dependencies via counterfactual inpainting.arXiv preprint arXiv:2503.21770, 2025
-
[53]
Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015
work page 2015
-
[54]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Google. Nano Banana: Gemini Image Generation Overview.https://gemini.google/ov erview/image-generation/, 2025. Accessed: September 22, 2025
work page 2025
-
[56]
Kevin Clark and Priyank Jaini. Text-to-image diffusion models are zero shot classifiers.Advances in Neural Information Processing Systems, 36:58921–58937, 2023
work page 2023
-
[57]
Intriguing properties of generative classifiers
Priyank Jaini, Kevin Clark, and Robert Geirhos. Intriguing properties of generative classifiers. InThe Twelfth International Conference on Learning Representations, 2023. 13 Video models are zero-shot learners and reasoners
work page 2023
-
[58]
Ryan Burgert, Kanchana Ranasinghe, Xiang Li, and Michael S Ryoo. Peekaboo: Text to image diffusion models are zero-shot segmentors.arXiv preprint arXiv:2211.13224, 2022
-
[59]
Text2video-zero: Text-to-image diffusion models are zero-shot video generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023
work page 2023
-
[60]
Dense extreme inception network: Towards a robust CNN model for edge detection
Xavier Soria, Edgar Riba, and Angel Sappa. Dense extreme inception network: Towards a robust CNN model for edge detection. InThe IEEE Winter Conference on Applications of Computer Vision (WACV ’20), 2020
work page 2020
-
[61]
Dense extreme inception network for edge detection.Pattern Recognition, 139:109461, 2023
Xavier Soria, Angel Sappa, Patricio Humanante, and Arash Akbarinia. Dense extreme inception network for edge detection.Pattern Recognition, 139:109461, 2023. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2023.109461. URL https://www.sciencedirect.com/ science/article/pii/S0031320323001619
-
[62]
LVIS: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019
work page 2019
-
[63]
Emu edit: Precise image editing via recognition and generation tasks
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024
work page 2024
-
[64]
Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024
Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, and Dacheng Tao. Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024
-
[65]
VEGGIE:Instructionaleditingandreasoningofvideoconceptswithgroundedgeneration
Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, and Mohit Bansal. VEGGIE:Instructionaleditingandreasoningofvideoconceptswithgroundedgeneration. arXiv preprint arXiv:2503.14350, 2025
-
[66]
Pathways on the image manifold: Image editing via video generation
Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaid, and Ron Kimmel. Pathways on the image manifold: Image editing via video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7857–7866, 2025
work page 2025
-
[67]
Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. Kiva: Kid-inspired visual analogies for testing large multimodal models.arXiv preprint arXiv:2407.17773, 2024
-
[68]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012
work page 2012
-
[69]
The broader spectrum of in-context learning.arXiv preprint arXiv:2412.03782, 2024
Andrew Kyle Lampinen, Stephanie CY Chan, Aaditya K Singh, and Murray Shanahan. The broader spectrum of in-context learning.arXiv preprint arXiv:2412.03782, 2024
-
[70]
Chaz Firestone. Performance vs. competence in human–machine comparisons.Proceedings of the National Academy of Sciences, 117(43):26562–26571, 2020
work page 2020
-
[71]
Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
work page 2020
-
[72]
LLM inference prices have fallen rapidly but unequally across tasks, march 2025
Ben Cottier, Ben Snodin, David Owen, and Tom Adamczewski. LLM inference prices have fallen rapidly but unequally across tasks, march 2025. URLhttps://epoch.ai/data-insights/ llm-inference-price-trends. Accessed: 2025-09-12. 14 Video models are zero-shot learners and reasoners
work page 2025
-
[73]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[74]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023
work page 2023
-
[75]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35: 27730–27744, 2022
work page 2022
-
[78]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4.arXiv preprint arXiv:2304.03277, 2023
work page internal anchor Pith review arXiv 2023
-
[79]
Wenhan Yang, Wenjing Wang, Haofeng Huang, Shiqi Wang, and Jiaying Liu. Sparse gradient regularized deep retinex network for robust low-light image enhancement.IEEE Transactions on Image Processing, 30:2072–2086, 2021
work page 2072
-
[80]
Declan Campbell, Sunayana Rane, Tyler Giallanza, Camillo Nicolò De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven Frankland, Tom Griffiths, Jonathan D Cohen, et al. Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.