pith. machine review for the scientific record. sign in

arxiv: 2509.20328 · v2 · submitted 2025-09-24 · 💻 cs.LG · cs.AI· cs.CV· cs.RO

Recognition: 1 theorem link

Video models are zero-shot learners and reasoners

Been Kim, Kevin Swersky, Nick Matarese, Paul Vicol, Priyank Jaini, Robert Geirhos, Shixiang Shane Gu, Thadd\"aus Wiedemer, Yuxuan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 02:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.RO
keywords video modelszero-shot learningvisual reasoninggenerative modelsfoundation modelsemergent capabilities
0
0 comments X

The pith

Generative video models like Veo 3 perform zero-shot object segmentation, edge detection, physics understanding, affordance recognition, tool simulation, and early visual reasoning such as maze and symmetry solving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that simple large generative video models trained on web-scale data can handle a wide range of vision tasks they were never explicitly trained for. These include segmenting objects, detecting edges, editing images, understanding physical properties, recognizing how objects afford actions, and simulating tool use. The same models also support basic visual reasoning like navigating mazes or identifying symmetries. A sympathetic reader would care because this pattern mirrors how large language models unified many language tasks through scaling alone, suggesting video models may follow the same route to general-purpose vision understanding.

Core claim

Veo 3 solves a broad variety of tasks it wasn't explicitly trained for, including segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more; these abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving, indicating that video models are on a path to becoming unified, generalist vision foundation models.

What carries the argument

The generative video model Veo 3, which uses its web-scale video training to produce emergent zero-shot perception, modeling, and manipulation of visual scenes.

If this is right

  • Many task-specific vision models could be replaced by a single video model for segmentation, editing, and basic reasoning.
  • Further scaling of video models should produce stronger visual reasoning without new task-specific training.
  • Video models could serve as the core for unified systems that both generate and understand visual worlds.
  • Physical-world interaction skills such as tool use simulation become available without separate robotics training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pattern holds, video models may reduce the need for separate datasets and architectures for each vision sub-task.
  • The same emergence might appear in other large generative models trained on image or 3D data.
  • Robotics and simulation environments could directly query video models for planning and affordance checks.
  • Benchmarks that test novel physical reasoning in video sequences would provide clearer tests of these claims.

Load-bearing premise

The shown capabilities are performed in a genuinely zero-shot way with no task information hidden in prompts, no data contamination, and no post-hoc selection of successful cases.

What would settle it

Run Veo 3 on a fresh set of tasks with no possible overlap in common video training data, using fixed neutral prompts that give no hints, and compare success rates against random guessing or non-video baselines.

read the original abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that the generative video model Veo 3 exhibits emergent zero-shot capabilities across diverse vision tasks—including object segmentation, edge detection, image editing, physical property understanding, affordance recognition, tool-use simulation, and early visual reasoning such as maze navigation and symmetry solving—without explicit task-specific training, suggesting that large-scale video generative pretraining can yield generalist vision foundation models analogous to LLMs.

Significance. If substantiated with quantitative controls, the result would indicate that video-scale generative pretraining can induce broad perceptual and reasoning abilities from web-scale data alone, potentially shifting vision modeling toward unified foundation models and opening avenues for zero-shot visual agents.

major comments (3)
  1. [Abstract] Abstract and results demonstrations: the central zero-shot claim rests entirely on curated qualitative examples with no reported aggregate success rates, error bars, or fixed task-suite metrics, making it impossible to assess generality or rule out selection bias.
  2. [Results] Results section on task demonstrations: no ablation of prompt phrasing, no decontamination protocol for test videos against training data, and no baseline comparisons are provided, so the interpretation that capabilities arise purely from generative pretraining rather than implicit task specification cannot be verified.
  3. [Visual Reasoning] Section on visual reasoning examples (maze and symmetry): without quantitative evaluation or controls for prompt leakage, the claim that these constitute emergent reasoning remains unsupported by the presented evidence.
minor comments (2)
  1. [Figures] Figure captions should explicitly state the exact conditioning text used for each demonstration to allow reproducibility assessment.
  2. [Discussion] The manuscript would benefit from a dedicated limitations paragraph discussing risks of data contamination and prompt sensitivity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical claims. We address each major comment below. Where feasible, we have revised the manuscript to add clarifications, additional examples, and discussions of limitations; however, some aspects are constrained by the proprietary nature of Veo 3.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results demonstrations: the central zero-shot claim rests entirely on curated qualitative examples with no reported aggregate success rates, error bars, or fixed task-suite metrics, making it impossible to assess generality or rule out selection bias.

    Authors: We acknowledge the value of quantitative metrics for assessing generality. Our initial demonstrations follow the qualitative style of early LLM papers that first illustrated emergent abilities before standardized benchmarks existed. In the revision, we have expanded the results to include a wider variety of examples (including documented failure cases) and added a dedicated limitations paragraph discussing selection bias and the absence of a fixed task suite. We note that developing aggregate metrics and error bars would require a new standardized benchmark, which we flag as future work rather than claiming current results are exhaustive. revision: partial

  2. Referee: [Results] Results section on task demonstrations: no ablation of prompt phrasing, no decontamination protocol for test videos against training data, and no baseline comparisons are provided, so the interpretation that capabilities arise purely from generative pretraining rather than implicit task specification cannot be verified.

    Authors: We have added a prompt-phrasing ablation in the supplementary material, testing multiple rewordings for several tasks to demonstrate robustness. Baseline comparisons to smaller open video models have been included where direct equivalents exist. However, a full decontamination protocol is not possible without access to Veo 3's proprietary training corpus; we have added an explicit limitations discussion acknowledging this constraint and arguing that the demonstrated tasks involve novel compositions unlikely to appear verbatim in web-scale data. revision: partial

  3. Referee: [Visual Reasoning] Section on visual reasoning examples (maze and symmetry): without quantitative evaluation or controls for prompt leakage, the claim that these constitute emergent reasoning remains unsupported by the presented evidence.

    Authors: We agree that quantitative support strengthens the reasoning claim. The revised manuscript now includes a small-scale quantitative evaluation: repeated trials on varied mazes with reported success rates across difficulty levels. We have also added explicit controls for prompt leakage by documenting all prompts used, testing paraphrased variants, and including these results in the main text. These changes provide measurable evidence beyond single curated examples. revision: yes

standing simulated objections not resolved
  • A complete decontamination protocol against Veo 3's proprietary training data cannot be performed without access to the closed training corpus.

Circularity Check

0 steps flagged

No circularity: purely empirical demonstrations without derivations or self-referential reductions

full rationale

The paper contains no mathematical derivations, equations, fitted parameters, or predictive claims that reduce to inputs by construction. It consists solely of qualitative visual examples showing Veo 3 performing tasks such as segmentation, edge detection, affordance recognition, and maze solving. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises that would create circularity. The central claims rest on selected demonstrations rather than any chain that equates outputs to prior fitted quantities or self-defined relations, rendering the work self-contained against the defined circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions about model scaling and emergent capabilities from large generative training. No free parameters are fitted in the reported results. No new entities are postulated.

axioms (1)
  • domain assumption Large generative models trained on web-scale data develop emergent capabilities beyond their training objective.
    Invoked in the introduction to frame the video model results as analogous to LLMs.

pith-pipeline@v0.9.0 · 5486 in / 1190 out tokens · 30133 ms · 2026-05-14T02:11:09.410472+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

    cs.CV 2026-04 unverdicted novelty 8.0

    ViPS distills a compact, controllable distribution of valid joint configurations for any auto-rigged mesh from video diffusion priors, matching 4D-trained methods in plausibility while generalizing zero-shot to unseen...

  2. Progressive Photorealistic Simplification

    cs.CV 2026-05 unverdicted novelty 7.0

    Progressive semantic image simplification uses VLMs and a verifier to iteratively remove and inpaint scene elements while preserving photorealism, distilled into an image-to-video model for direct sequence prediction.

  3. Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

    cs.CL 2026-05 unverdicted novelty 7.0

    LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

  4. CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

    cs.CV 2026-05 unverdicted novelty 7.0

    CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.

  5. Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

    cs.CV 2026-05 unverdicted novelty 7.0

    Eulerian adjacent-frame motion fields with bidirectional cycle consistency checks enable faster parallel training and fewer artifacts in diffusion model image animation compared to initial-frame Lagrangian guidance.

  6. Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

    cs.CV 2026-05 unverdicted novelty 7.0

    Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency improves training speed, temporal coherence, and artifact reduction in diffusion-based image animation.

  7. Grokking of Diffusion Models: Case Study on Modular Addition

    cs.LG 2026-04 unverdicted novelty 7.0

    Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

  8. GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.

  9. WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

    cs.CV 2026-05 unverdicted novelty 6.0

    The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.

  10. Do multimodal models imagine electric sheep?

    cs.CV 2026-05 conditional novelty 6.0

    Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

  11. Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

    cs.CV 2026-05 unverdicted novelty 6.0

    Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency yields faster training and more coherent diffusion-based image animation than first-frame reference methods.

  12. Open-Source Image Editing Models Are Zero-Shot Vision Learners

    cs.CV 2026-05 unverdicted novelty 6.0

    Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.

  13. Image Generators are Generalist Vision Learners

    cs.CV 2026-04 unverdicted novelty 6.0

    Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.

  14. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  15. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  16. VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and ...

  17. Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

    cs.CV 2026-04 unverdicted novelty 6.0

    A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.

  18. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  19. Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

    cs.CV 2026-03 unverdicted novelty 6.0

    Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.

  20. Motif-Video 2B: Technical Report

    cs.CV 2026-04 unverdicted novelty 5.0

    Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.

  21. Neural Computers

    cs.LG 2026-04 unverdicted novelty 5.0

    Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...

  22. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

  23. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 21 Pith papers · 14 internal anchors

  1. [1]

    A Survey on Large Language Models for Code Generation

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.arXiv preprint arXiv:2406.00515, 2024

  2. [2]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  3. [3]

    Weaver: Foundation models for creative writing

    Tiannan Wang, Jiamin Chen, Qingrui Jia, Shuai Wang, Ruoyu Fang, Huilin Wang, Zhaowei Gao, Chunzhao Xie, Chuou Xu, Jihong Dai, et al. Weaver: Foundation models for creative writing. arXiv preprint arXiv:2401.17268, 2024

  4. [4]

    Multilingual machine translation with large language models: Empirical results and analysis.arXiv preprint arXiv:2304.04675, 2023

    Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis.arXiv preprint arXiv:2304.04675, 2023

  5. [5]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  6. [6]

    Agent Laboratory: Using

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.arXiv preprint arXiv:2501.04227, 2025

  7. [7]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  8. [8]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

  9. [9]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2022. 10 Video models are zero-shot learners and reasoners

  10. [10]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35: 22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35: 22199–22213, 2022

  11. [11]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  12. [12]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  13. [13]

    You only look once: Unified, real-time object detection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

  14. [14]

    YOLOv11: An Overview of the Key Architectural Enhancements

    Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024

  15. [15]

    From generation to generalization: Emergent few-shot learning in video diffusion models.arXiv preprint arXiv:2506.07280, 2025

    Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, and Paolo Favaro. From generation to generalization: Emergent few-shot learning in video diffusion models.arXiv preprint arXiv:2506.07280, 2025

  16. [16]

    Taskonomy: Disentangling task transfer learning

    Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018

  17. [17]

    Realgeneral: Unifying visual generation via temporal in-context learning with video models.arXiv preprint arXiv:2503.10406, 2025

    Yijing Lin, Mengqi Huang, Shuhan Zhuang, and Zhendong Mao. Realgeneral: Unifying visual generation via temporal in-context learning with video models.arXiv preprint arXiv:2503.10406, 2025

  18. [18]

    Visualcloze: A universal image generation framework via visual in-context learning

    Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, and Ming-Ming Cheng. Visualcloze: A universal image generation framework via visual in-context learning. arXiv preprint arXiv:2504.07960, 2025

  19. [19]

    Images speak in images: A generalist painter for in-context visual learning

    Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023

  20. [20]

    Test- time visual in-context tuning

    Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, and Bernt Schiele. Test- time visual in-context tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19996–20005, 2025

  21. [21]

    Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024

    Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, et al. Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024

  22. [22]

    Omnigen: Unified image generation

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025

  23. [23]

    One diffusion to generate them all

    Duong H Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, and Jiasen Lu. One diffusion to generate them all. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2671–2682, 2025

  24. [24]

    Dreamix: Video diffusion models are general video editors

    Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, and Yedid Hoshen. Dreamix: Video diffusion models are general video editors.arXiv preprint arXiv:2302.01329, 2023. 11 Video models are zero-shot learners and reasoners

  25. [25]

    Scalingproperties of diffusion models for perceptual tasks

    RahulRavishankar,ZeeshanPatel,JathushanRajasegaran,andJitendraMalik. Scalingproperties of diffusion models for perceptual tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12945–12954, 2025

  26. [26]

    Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024

    Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schuurmans. Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024

  27. [27]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  28. [28]

    Large language models are human-level prompt engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022

  29. [29]

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

  30. [30]

    Vertex AI Veo Prompt Rewriter.https://cloud.google.com/vertex-ai/ generative-ai/docs/video/turn-the-prompt-rewriter-off#prompt-rewriter ,

    Google Cloud. Vertex AI Veo Prompt Rewriter.https://cloud.google.com/vertex-ai/ generative-ai/docs/video/turn-the-prompt-rewriter-off#prompt-rewriter ,

  31. [31]

    Accessed: September 22, 2025

  32. [32]

    Lmsys org text-to-video leaderboard.https://lmarena.ai/leaderboard/t ext-to-video, September 2025

    LMSYS ORG. Lmsys org text-to-video leaderboard.https://lmarena.ai/leaderboard/t ext-to-video, September 2025. Accessed: 2025-09-23

  33. [33]

    Veo 2 announcement.https://blog.google/technology/google-labs/vide o-image-generation-update-december-2024/, 2024

    Google. Veo 2 announcement.https://blog.google/technology/google-labs/vide o-image-generation-update-december-2024/, 2024. Accessed: September 22, 2025

  34. [34]

    Veo 2 launch.https://developers.googleblog.com/en/veo-2-video-gen eration-now-generally-available/, 2025

    Google. Veo 2 launch.https://developers.googleblog.com/en/veo-2-video-gen eration-now-generally-available/, 2025. Accessed: September 22, 2025

  35. [35]

    Veo 3 announcement.https://blog.google/technology/ai/generative-m edia-models-io-2025/, 2025

    Google. Veo 3 announcement.https://blog.google/technology/ai/generative-m edia-models-io-2025/, 2025. Accessed: September 22, 2025

  36. [36]

    Veo 3 launch

    Google. Veo 3 launch. https://cloud.google.com/blog/products/ai-machine-l earning/veo-3-fast-available-for-everyone-on-vertex-ai , 2025. Accessed: September 22, 2025

  37. [37]

    Holistically-nested edge detection

    Saining Xie and Zhuowen Tu. Holistically-nested edge detection. InProceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015

  38. [38]

    IntPhys: A framework and benchmark for visual intuitive physics reasoning.arXiv preprint arXiv:1803.07616, 2018

    Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. IntPhys: A framework and benchmark for visual intuitive physics reasoning.arXiv preprint arXiv:1803.07616, 2018

  39. [39]

    Bear, Elias Wang, Damian Mrowca, Felix J

    Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Yamins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines, 2021

  40. [40]

    Benchmarking progress to infant-level physical reasoning in ai

    Luca Weihs, Amanda Yuile, Renée Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mottaghi, and Aniruddha Kembhavi. Benchmarking progress to infant-level physical reasoning in ai. Transactions on Machine Learning Research, 2022

  41. [41]

    GRASP: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023

    Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. GRASP: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023. 12 Video models are zero-shot learners and reasoners

  42. [42]

    Physion++: Evaluating physical scene understanding that requires online inference of different physical properties.Advances in Neural Information Processing Systems, 36, 2024

    Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Josh Tenenbaum, Dan Yamins, Judith Fan, and Kevin Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties.Advances in Neural Information Processing Systems, 36, 2024

  43. [43]

    Videophy: Evaluating physical commonsense for video generation, 2024

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chen- fanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation, 2024

  44. [44]

    LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

    Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. LLMPhy: Complex physical reasoning using large language models and world models.arXiv preprint arXiv:2411.08027, 2024

  45. [45]

    Towards world simulator: Crafting physical commonsense- based benchmark for video generation, 2024

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation, 2024

  46. [46]

    How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

  47. [47]

    Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

    Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

  48. [48]

    Generative physical AI in vision: A survey.arXiv preprint arXiv:2501.10928, 2025

    Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, and Chang Xu. Generative physical AI in vision: A survey.arXiv preprint arXiv:2501.10928, 2025

  49. [49]

    Visual cognition in multimodal large language models.Nature Machine Intelligence, pages 1–11, 2025

    Luca M Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. Visual cognition in multimodal large language models.Nature Machine Intelligence, pages 1–11, 2025

  50. [50]

    Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

    Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios Vozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, and Efstratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

  51. [51]

    Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

    Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

  52. [52]

    Visual jenga: Discovering object dependencies via counterfactual inpainting.arXiv preprint arXiv:2503.21770, 2025

    Anand Bhattad, Konpat Preechakul, and Alexei A Efros. Visual jenga: Discovering object dependencies via counterfactual inpainting.arXiv preprint arXiv:2503.21770, 2025

  53. [53]

    Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

    Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

  54. [54]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  55. [55]

    Nano Banana: Gemini Image Generation Overview.https://gemini.google/ov erview/image-generation/, 2025

    Google. Nano Banana: Gemini Image Generation Overview.https://gemini.google/ov erview/image-generation/, 2025. Accessed: September 22, 2025

  56. [56]

    Text-to-image diffusion models are zero shot classifiers.Advances in Neural Information Processing Systems, 36:58921–58937, 2023

    Kevin Clark and Priyank Jaini. Text-to-image diffusion models are zero shot classifiers.Advances in Neural Information Processing Systems, 36:58921–58937, 2023

  57. [57]

    Intriguing properties of generative classifiers

    Priyank Jaini, Kevin Clark, and Robert Geirhos. Intriguing properties of generative classifiers. InThe Twelfth International Conference on Learning Representations, 2023. 13 Video models are zero-shot learners and reasoners

  58. [58]

    Peekaboo: Text to image diffusion models are zero-shot segmentors.arXiv preprint arXiv:2211.13224, 2022

    Ryan Burgert, Kanchana Ranasinghe, Xiang Li, and Michael S Ryoo. Peekaboo: Text to image diffusion models are zero-shot segmentors.arXiv preprint arXiv:2211.13224, 2022

  59. [59]

    Text2video-zero: Text-to-image diffusion models are zero-shot video generators

    Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

  60. [60]

    Dense extreme inception network: Towards a robust CNN model for edge detection

    Xavier Soria, Edgar Riba, and Angel Sappa. Dense extreme inception network: Towards a robust CNN model for edge detection. InThe IEEE Winter Conference on Applications of Computer Vision (WACV ’20), 2020

  61. [61]

    Dense extreme inception network for edge detection.Pattern Recognition, 139:109461, 2023

    Xavier Soria, Angel Sappa, Patricio Humanante, and Arash Akbarinia. Dense extreme inception network for edge detection.Pattern Recognition, 139:109461, 2023. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2023.109461. URL https://www.sciencedirect.com/ science/article/pii/S0031320323001619

  62. [62]

    LVIS: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

  63. [63]

    Emu edit: Precise image editing via recognition and generation tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

  64. [64]

    Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024

    Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, and Dacheng Tao. Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024

  65. [65]

    VEGGIE:Instructionaleditingandreasoningofvideoconceptswithgroundedgeneration

    Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, and Mohit Bansal. VEGGIE:Instructionaleditingandreasoningofvideoconceptswithgroundedgeneration. arXiv preprint arXiv:2503.14350, 2025

  66. [66]

    Pathways on the image manifold: Image editing via video generation

    Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaid, and Ron Kimmel. Pathways on the image manifold: Image editing via video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7857–7866, 2025

  67. [67]

    Kiva: Kid-inspired visual analogies for testing large multimodal models.arXiv preprint arXiv:2407.17773, 2024

    Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. Kiva: Kid-inspired visual analogies for testing large multimodal models.arXiv preprint arXiv:2407.17773, 2024

  68. [68]

    ImageNet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  69. [69]

    The broader spectrum of in-context learning.arXiv preprint arXiv:2412.03782, 2024

    Andrew Kyle Lampinen, Stephanie CY Chan, Aaditya K Singh, and Murray Shanahan. The broader spectrum of in-context learning.arXiv preprint arXiv:2412.03782, 2024

  70. [70]

    Performance vs

    Chaz Firestone. Performance vs. competence in human–machine comparisons.Proceedings of the National Academy of Sciences, 117(43):26562–26571, 2020

  71. [71]

    Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  72. [72]

    LLM inference prices have fallen rapidly but unequally across tasks, march 2025

    Ben Cottier, Ben Snodin, David Owen, and Tom Adamczewski. LLM inference prices have fallen rapidly but unequally across tasks, march 2025. URLhttps://epoch.ai/data-insights/ llm-inference-price-trends. Accessed: 2025-09-12. 14 Video models are zero-shot learners and reasoners

  73. [73]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  74. [74]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

  75. [75]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 system card.arXiv preprint arXiv:2412.16720, 2024

  76. [76]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

  77. [77]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35: 27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35: 27730–27744, 2022

  78. [78]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4.arXiv preprint arXiv:2304.03277, 2023

  79. [79]

    Sparse gradient regularized deep retinex network for robust low-light image enhancement.IEEE Transactions on Image Processing, 30:2072–2086, 2021

    Wenhan Yang, Wenjing Wang, Haofeng Huang, Shiqi Wang, and Jiaying Liu. Sparse gradient regularized deep retinex network for robust low-light image enhancement.IEEE Transactions on Image Processing, 30:2072–2086, 2021

  80. [80]

    Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024

    Declan Campbell, Sunayana Rane, Tyler Giallanza, Camillo Nicolò De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven Frankland, Tom Griffiths, Jonathan D Cohen, et al. Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024

Showing first 80 references.