arxiv: 2509.20328 · v2 · submitted 2025-09-24 · 💻 cs.LG · cs.AI· cs.CV· cs.RO

Recognition: 1 theorem link

Video models are zero-shot learners and reasoners

Been Kim, Kevin Swersky, Nick Matarese, Paul Vicol, Priyank Jaini, Robert Geirhos, Shixiang Shane Gu, Thadd\"aus Wiedemer, Yuxuan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 02:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.RO

keywords video modelszero-shot learningvisual reasoninggenerative modelsfoundation modelsemergent capabilities

0 comments

The pith

Generative video models like Veo 3 perform zero-shot object segmentation, edge detection, physics understanding, affordance recognition, tool simulation, and early visual reasoning such as maze and symmetry solving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that simple large generative video models trained on web-scale data can handle a wide range of vision tasks they were never explicitly trained for. These include segmenting objects, detecting edges, editing images, understanding physical properties, recognizing how objects afford actions, and simulating tool use. The same models also support basic visual reasoning like navigating mazes or identifying symmetries. A sympathetic reader would care because this pattern mirrors how large language models unified many language tasks through scaling alone, suggesting video models may follow the same route to general-purpose vision understanding.

Core claim

Veo 3 solves a broad variety of tasks it wasn't explicitly trained for, including segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more; these abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving, indicating that video models are on a path to becoming unified, generalist vision foundation models.

What carries the argument

The generative video model Veo 3, which uses its web-scale video training to produce emergent zero-shot perception, modeling, and manipulation of visual scenes.

If this is right

Many task-specific vision models could be replaced by a single video model for segmentation, editing, and basic reasoning.
Further scaling of video models should produce stronger visual reasoning without new task-specific training.
Video models could serve as the core for unified systems that both generate and understand visual worlds.
Physical-world interaction skills such as tool use simulation become available without separate robotics training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds, video models may reduce the need for separate datasets and architectures for each vision sub-task.
The same emergence might appear in other large generative models trained on image or 3D data.
Robotics and simulation environments could directly query video models for planning and affordance checks.
Benchmarks that test novel physical reasoning in video sequences would provide clearer tests of these claims.

Load-bearing premise

The shown capabilities are performed in a genuinely zero-shot way with no task information hidden in prompts, no data contamination, and no post-hoc selection of successful cases.

What would settle it

Run Veo 3 on a fresh set of tasks with no possible overlap in common video training data, using fixed neutral prompts that give no hints, and compare success rates against random guessing or non-video baselines.

read the original abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Veo 3 shows some zero-shot visual capabilities in demos, but lacks the quantitative controls needed to confirm they are truly emergent.

read the letter

Veo 3 can handle a range of visual tasks like segmentation and maze solving in the examples shown, suggesting that video generation training alone might produce some general visual intelligence. The new part is applying this to current video models and showing it across perception, editing, and basic reasoning tasks. The paper does well at presenting a diverse set of cases that make the point visually clear and connect it back to the LLM scaling narrative. Where it falls short is the complete absence of quantitative results. No success percentages, no baseline comparisons, no checks for whether the prompts contain hidden task cues, and no mention of data decontamination. The examples look selected, which makes it hard to judge how consistent or robust these abilities are. The stress-test point about controls for prompt leakage holds up based on the abstract and the described approach. This kind of work is for people studying foundation models and emergent abilities in generative systems. A reader interested in whether video models will follow the same path as language models will find the examples thought-provoking as a proof of concept. It shows clear thinking about the implications even if the current evidence is preliminary. I would bring it to the next reading group to discuss what additional experiments would be needed. I would not cite it in my own papers until there are numbers to back the claims. It deserves peer review because the question is important and the demonstrations are intriguing enough to warrant a closer look with revisions.

Referee Report

3 major / 2 minor

Summary. The paper claims that the generative video model Veo 3 exhibits emergent zero-shot capabilities across diverse vision tasks—including object segmentation, edge detection, image editing, physical property understanding, affordance recognition, tool-use simulation, and early visual reasoning such as maze navigation and symmetry solving—without explicit task-specific training, suggesting that large-scale video generative pretraining can yield generalist vision foundation models analogous to LLMs.

Significance. If substantiated with quantitative controls, the result would indicate that video-scale generative pretraining can induce broad perceptual and reasoning abilities from web-scale data alone, potentially shifting vision modeling toward unified foundation models and opening avenues for zero-shot visual agents.

major comments (3)

[Abstract] Abstract and results demonstrations: the central zero-shot claim rests entirely on curated qualitative examples with no reported aggregate success rates, error bars, or fixed task-suite metrics, making it impossible to assess generality or rule out selection bias.
[Results] Results section on task demonstrations: no ablation of prompt phrasing, no decontamination protocol for test videos against training data, and no baseline comparisons are provided, so the interpretation that capabilities arise purely from generative pretraining rather than implicit task specification cannot be verified.
[Visual Reasoning] Section on visual reasoning examples (maze and symmetry): without quantitative evaluation or controls for prompt leakage, the claim that these constitute emergent reasoning remains unsupported by the presented evidence.

minor comments (2)

[Figures] Figure captions should explicitly state the exact conditioning text used for each demonstration to allow reproducibility assessment.
[Discussion] The manuscript would benefit from a dedicated limitations paragraph discussing risks of data contamination and prompt sensitivity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical claims. We address each major comment below. Where feasible, we have revised the manuscript to add clarifications, additional examples, and discussions of limitations; however, some aspects are constrained by the proprietary nature of Veo 3.

read point-by-point responses

Referee: [Abstract] Abstract and results demonstrations: the central zero-shot claim rests entirely on curated qualitative examples with no reported aggregate success rates, error bars, or fixed task-suite metrics, making it impossible to assess generality or rule out selection bias.

Authors: We acknowledge the value of quantitative metrics for assessing generality. Our initial demonstrations follow the qualitative style of early LLM papers that first illustrated emergent abilities before standardized benchmarks existed. In the revision, we have expanded the results to include a wider variety of examples (including documented failure cases) and added a dedicated limitations paragraph discussing selection bias and the absence of a fixed task suite. We note that developing aggregate metrics and error bars would require a new standardized benchmark, which we flag as future work rather than claiming current results are exhaustive. revision: partial
Referee: [Results] Results section on task demonstrations: no ablation of prompt phrasing, no decontamination protocol for test videos against training data, and no baseline comparisons are provided, so the interpretation that capabilities arise purely from generative pretraining rather than implicit task specification cannot be verified.

Authors: We have added a prompt-phrasing ablation in the supplementary material, testing multiple rewordings for several tasks to demonstrate robustness. Baseline comparisons to smaller open video models have been included where direct equivalents exist. However, a full decontamination protocol is not possible without access to Veo 3's proprietary training corpus; we have added an explicit limitations discussion acknowledging this constraint and arguing that the demonstrated tasks involve novel compositions unlikely to appear verbatim in web-scale data. revision: partial
Referee: [Visual Reasoning] Section on visual reasoning examples (maze and symmetry): without quantitative evaluation or controls for prompt leakage, the claim that these constitute emergent reasoning remains unsupported by the presented evidence.

Authors: We agree that quantitative support strengthens the reasoning claim. The revised manuscript now includes a small-scale quantitative evaluation: repeated trials on varied mazes with reported success rates across difficulty levels. We have also added explicit controls for prompt leakage by documenting all prompts used, testing paraphrased variants, and including these results in the main text. These changes provide measurable evidence beyond single curated examples. revision: yes

standing simulated objections not resolved

A complete decontamination protocol against Veo 3's proprietary training data cannot be performed without access to the closed training corpus.

Circularity Check

0 steps flagged

No circularity: purely empirical demonstrations without derivations or self-referential reductions

full rationale

The paper contains no mathematical derivations, equations, fitted parameters, or predictive claims that reduce to inputs by construction. It consists solely of qualitative visual examples showing Veo 3 performing tasks such as segmentation, edge detection, affordance recognition, and maze solving. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises that would create circularity. The central claims rest on selected demonstrations rather than any chain that equates outputs to prior fitted quantities or self-defined relations, rendering the work self-contained against the defined circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions about model scaling and emergent capabilities from large generative training. No free parameters are fitted in the reported results. No new entities are postulated.

axioms (1)

domain assumption Large generative models trained on web-scale data develop emergent capabilities beyond their training objective.
Invoked in the introduction to frame the video model results as analogous to LLMs.

pith-pipeline@v0.9.0 · 5486 in / 1190 out tokens · 30133 ms · 2026-05-14T02:11:09.410472+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes
cs.CV 2026-04 unverdicted novelty 8.0

ViPS distills a compact, controllable distribution of valid joint configurations for any auto-rigged mesh from video diffusion priors, matching 4D-trained methods in plausibility while generalizing zero-shot to unseen...
Progressive Photorealistic Simplification
cs.CV 2026-05 unverdicted novelty 7.0

Progressive semantic image simplification uses VLMs and a verifier to iteratively remove and inpaint scene elements while preserving photorealism, distilled into an image-to-video model for direct sequence prediction.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
cs.CL 2026-05 unverdicted novelty 7.0

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
cs.CV 2026-05 unverdicted novelty 7.0

CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
cs.CV 2026-05 unverdicted novelty 7.0

Eulerian adjacent-frame motion fields with bidirectional cycle consistency checks enable faster parallel training and fewer artifacts in diffusion model image animation compared to initial-frame Lagrangian guidance.
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
cs.CV 2026-05 unverdicted novelty 7.0

Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency improves training speed, temporal coherence, and artifact reduction in diffusion-based image animation.
Grokking of Diffusion Models: Case Study on Modular Addition
cs.LG 2026-04 unverdicted novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models
cs.CV 2026-04 unverdicted novelty 7.0

GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
cs.CV 2026-05 unverdicted novelty 6.0

The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
Do multimodal models imagine electric sheep?
cs.CV 2026-05 conditional novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
cs.CV 2026-05 unverdicted novelty 6.0

Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency yields faster training and more coherent diffusion-based image animation than first-frame reference methods.
Open-Source Image Editing Models Are Zero-Shot Vision Learners
cs.CV 2026-05 unverdicted novelty 6.0

Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 unverdicted novelty 6.0

Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning
cs.CV 2026-04 unverdicted novelty 6.0

VibeFlow performs versatile video chroma-lux editing in zero-shot fashion by self-supervised disentanglement of structure and color-illumination cues inside pre-trained video models, plus residual velocity fields and ...
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
cs.CV 2026-04 unverdicted novelty 6.0

A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
cs.CV 2026-03 unverdicted novelty 6.0

Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
Motif-Video 2B: Technical Report
cs.CV 2026-04 unverdicted novelty 5.0

Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.
Neural Computers
cs.LG 2026-04 unverdicted novelty 5.0

Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 21 Pith papers · 14 internal anchors

[1]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.arXiv preprint arXiv:2406.00515, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Weaver: Foundation models for creative writing

Tiannan Wang, Jiamin Chen, Qingrui Jia, Shuai Wang, Ruoyu Fang, Huilin Wang, Zhaowei Gao, Chunzhao Xie, Chuou Xu, Jihong Dai, et al. Weaver: Foundation models for creative writing. arXiv preprint arXiv:2401.17268, 2024

work page arXiv 2024
[4]

Multilingual machine translation with large language models: Empirical results and analysis.arXiv preprint arXiv:2304.04675, 2023

Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis.arXiv preprint arXiv:2304.04675, 2023

work page arXiv 2023
[5]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI Scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Agent Laboratory: Using

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.arXiv preprint arXiv:2501.04227, 2025

work page arXiv 2025
[7]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[8]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2022. 10 Video models are zero-shot learners and reasoners

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35: 22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35: 22199–22213, 2022

work page 2022
[11]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[12]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

You only look once: Unified, real-time object detection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

work page 2016
[14]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

From generation to generalization: Emergent few-shot learning in video diffusion models.arXiv preprint arXiv:2506.07280, 2025

Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, and Paolo Favaro. From generation to generalization: Emergent few-shot learning in video diffusion models.arXiv preprint arXiv:2506.07280, 2025

work page arXiv 2025
[16]

Taskonomy: Disentangling task transfer learning

Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018

work page 2018
[17]

Realgeneral: Unifying visual generation via temporal in-context learning with video models.arXiv preprint arXiv:2503.10406, 2025

Yijing Lin, Mengqi Huang, Shuhan Zhuang, and Zhendong Mao. Realgeneral: Unifying visual generation via temporal in-context learning with video models.arXiv preprint arXiv:2503.10406, 2025

work page arXiv 2025
[18]

Visualcloze: A universal image generation framework via visual in-context learning

Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, and Ming-Ming Cheng. Visualcloze: A universal image generation framework via visual in-context learning. arXiv preprint arXiv:2504.07960, 2025

work page arXiv 2025
[19]

Images speak in images: A generalist painter for in-context visual learning

Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023

work page 2023
[20]

Test- time visual in-context tuning

Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, and Bernt Schiele. Test- time visual in-context tuning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19996–20005, 2025

work page 2025
[21]

Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024

Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, et al. Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024

work page arXiv 2024
[22]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025

work page 2025
[23]

One diffusion to generate them all

Duong H Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, and Jiasen Lu. One diffusion to generate them all. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2671–2682, 2025

work page 2025
[24]

Dreamix: Video diffusion models are general video editors

Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, and Yedid Hoshen. Dreamix: Video diffusion models are general video editors.arXiv preprint arXiv:2302.01329, 2023. 11 Video models are zero-shot learners and reasoners

work page arXiv 2023
[25]

Scalingproperties of diffusion models for perceptual tasks

RahulRavishankar,ZeeshanPatel,JathushanRajasegaran,andJitendraMalik. Scalingproperties of diffusion models for perceptual tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12945–12954, 2025

work page 2025
[26]

Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024

Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schuurmans. Video as the new language for real-world decision making.arXiv preprint arXiv:2402.17139, 2024

work page arXiv 2024
[27]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[28]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022

work page 2022
[29]

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys, 55(9):1–35, 2023

work page 2023
[30]

Vertex AI Veo Prompt Rewriter.https://cloud.google.com/vertex-ai/ generative-ai/docs/video/turn-the-prompt-rewriter-off#prompt-rewriter ,

Google Cloud. Vertex AI Veo Prompt Rewriter.https://cloud.google.com/vertex-ai/ generative-ai/docs/video/turn-the-prompt-rewriter-off#prompt-rewriter ,

work page
[31]

Accessed: September 22, 2025

work page 2025
[32]

Lmsys org text-to-video leaderboard.https://lmarena.ai/leaderboard/t ext-to-video, September 2025

LMSYS ORG. Lmsys org text-to-video leaderboard.https://lmarena.ai/leaderboard/t ext-to-video, September 2025. Accessed: 2025-09-23

work page 2025
[33]

Veo 2 announcement.https://blog.google/technology/google-labs/vide o-image-generation-update-december-2024/, 2024

Google. Veo 2 announcement.https://blog.google/technology/google-labs/vide o-image-generation-update-december-2024/, 2024. Accessed: September 22, 2025

work page 2024
[34]

Veo 2 launch.https://developers.googleblog.com/en/veo-2-video-gen eration-now-generally-available/, 2025

Google. Veo 2 launch.https://developers.googleblog.com/en/veo-2-video-gen eration-now-generally-available/, 2025. Accessed: September 22, 2025

work page 2025
[35]

Veo 3 announcement.https://blog.google/technology/ai/generative-m edia-models-io-2025/, 2025

Google. Veo 3 announcement.https://blog.google/technology/ai/generative-m edia-models-io-2025/, 2025. Accessed: September 22, 2025

work page 2025
[36]

Veo 3 launch

Google. Veo 3 launch. https://cloud.google.com/blog/products/ai-machine-l earning/veo-3-fast-available-for-everyone-on-vertex-ai , 2025. Accessed: September 22, 2025

work page 2025
[37]

Holistically-nested edge detection

Saining Xie and Zhuowen Tu. Holistically-nested edge detection. InProceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015

work page 2015
[38]

IntPhys: A framework and benchmark for visual intuitive physics reasoning.arXiv preprint arXiv:1803.07616, 2018

Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. IntPhys: A framework and benchmark for visual intuitive physics reasoning.arXiv preprint arXiv:1803.07616, 2018

work page arXiv 2018
[39]

Bear, Elias Wang, Damian Mrowca, Felix J

Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Yu Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Yun Sun, Li Fei-Fei, Nancy Kanwisher, Joshua B. Tenenbaum, Daniel L. K. Yamins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines, 2021

work page 2021
[40]

Benchmarking progress to infant-level physical reasoning in ai

Luca Weihs, Amanda Yuile, Renée Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mottaghi, and Aniruddha Kembhavi. Benchmarking progress to infant-level physical reasoning in ai. Transactions on Machine Learning Research, 2022

work page 2022
[41]

GRASP: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023

Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. GRASP: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023. 12 Video models are zero-shot learners and reasoners

work page arXiv 2023
[42]

Physion++: Evaluating physical scene understanding that requires online inference of different physical properties.Advances in Neural Information Processing Systems, 36, 2024

Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Josh Tenenbaum, Dan Yamins, Judith Fan, and Kevin Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[43]

Videophy: Evaluating physical commonsense for video generation, 2024

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chen- fanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation, 2024

work page 2024
[44]

LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres. LLMPhy: Complex physical reasoning using large language models and world models.arXiv preprint arXiv:2411.08027, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Towards world simulator: Crafting physical commonsense- based benchmark for video generation, 2024

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation, 2024

work page 2024
[46]

How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

work page arXiv 2024
[47]

Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

work page arXiv 2025
[48]

Generative physical AI in vision: A survey.arXiv preprint arXiv:2501.10928, 2025

Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, and Chang Xu. Generative physical AI in vision: A survey.arXiv preprint arXiv:2501.10928, 2025

work page arXiv 2025
[49]

Visual cognition in multimodal large language models.Nature Machine Intelligence, pages 1–11, 2025

Luca M Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. Visual cognition in multimodal large language models.Nature Machine Intelligence, pages 1–11, 2025

work page 2025
[50]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

Chenyu Zhang, Daniil Cherniavskii, Andrii Zadaianchuk, Antonios Tragoudaras, Antonios Vozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, and Efstratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

work page arXiv 2025
[51]

Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

work page arXiv 2025
[52]

Visual jenga: Discovering object dependencies via counterfactual inpainting.arXiv preprint arXiv:2503.21770, 2025

Anand Bhattad, Konpat Preechakul, and Alexei A Efros. Visual jenga: Discovering object dependencies via counterfactual inpainting.arXiv preprint arXiv:2503.21770, 2025

work page arXiv 2025
[53]

Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

work page 2015
[54]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Nano Banana: Gemini Image Generation Overview.https://gemini.google/ov erview/image-generation/, 2025

Google. Nano Banana: Gemini Image Generation Overview.https://gemini.google/ov erview/image-generation/, 2025. Accessed: September 22, 2025

work page 2025
[56]

Text-to-image diffusion models are zero shot classifiers.Advances in Neural Information Processing Systems, 36:58921–58937, 2023

Kevin Clark and Priyank Jaini. Text-to-image diffusion models are zero shot classifiers.Advances in Neural Information Processing Systems, 36:58921–58937, 2023

work page 2023
[57]

Intriguing properties of generative classifiers

Priyank Jaini, Kevin Clark, and Robert Geirhos. Intriguing properties of generative classifiers. InThe Twelfth International Conference on Learning Representations, 2023. 13 Video models are zero-shot learners and reasoners

work page 2023
[58]

Peekaboo: Text to image diffusion models are zero-shot segmentors.arXiv preprint arXiv:2211.13224, 2022

Ryan Burgert, Kanchana Ranasinghe, Xiang Li, and Michael S Ryoo. Peekaboo: Text to image diffusion models are zero-shot segmentors.arXiv preprint arXiv:2211.13224, 2022

work page arXiv 2022
[59]

Text2video-zero: Text-to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

work page 2023
[60]

Dense extreme inception network: Towards a robust CNN model for edge detection

Xavier Soria, Edgar Riba, and Angel Sappa. Dense extreme inception network: Towards a robust CNN model for edge detection. InThe IEEE Winter Conference on Applications of Computer Vision (WACV ’20), 2020

work page 2020
[61]

Dense extreme inception network for edge detection.Pattern Recognition, 139:109461, 2023

Xavier Soria, Angel Sappa, Patricio Humanante, and Arash Akbarinia. Dense extreme inception network for edge detection.Pattern Recognition, 139:109461, 2023. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2023.109461. URL https://www.sciencedirect.com/ science/article/pii/S0031320323001619

work page doi:10.1016/j.patcog.2023.109461 2023
[62]

LVIS: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[63]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

work page 2024
[64]

Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024

Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, and Dacheng Tao. Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024

work page arXiv 2024
[65]

VEGGIE:Instructionaleditingandreasoningofvideoconceptswithgroundedgeneration

Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, and Mohit Bansal. VEGGIE:Instructionaleditingandreasoningofvideoconceptswithgroundedgeneration. arXiv preprint arXiv:2503.14350, 2025

work page arXiv 2025
[66]

Pathways on the image manifold: Image editing via video generation

Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaid, and Ron Kimmel. Pathways on the image manifold: Image editing via video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7857–7866, 2025

work page 2025
[67]

Kiva: Kid-inspired visual analogies for testing large multimodal models.arXiv preprint arXiv:2407.17773, 2024

Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. Kiva: Kid-inspired visual analogies for testing large multimodal models.arXiv preprint arXiv:2407.17773, 2024

work page arXiv 2024
[68]

ImageNet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012
[69]

The broader spectrum of in-context learning.arXiv preprint arXiv:2412.03782, 2024

Andrew Kyle Lampinen, Stephanie CY Chan, Aaditya K Singh, and Murray Shanahan. The broader spectrum of in-context learning.arXiv preprint arXiv:2412.03782, 2024

work page arXiv 2024
[70]

Performance vs

Chaz Firestone. Performance vs. competence in human–machine comparisons.Proceedings of the National Academy of Sciences, 117(43):26562–26571, 2020

work page 2020
[71]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020
[72]

LLM inference prices have fallen rapidly but unequally across tasks, march 2025

Ben Cottier, Ben Snodin, David Owen, and Tom Adamczewski. LLM inference prices have fallen rapidly but unequally across tasks, march 2025. URLhttps://epoch.ai/data-insights/ llm-inference-price-trends. Accessed: 2025-09-12. 14 Video models are zero-shot learners and reasoners

work page 2025
[73]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[74]

Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

work page 2023
[75]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35: 27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35: 27730–27744, 2022

work page 2022
[78]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4.arXiv preprint arXiv:2304.03277, 2023

work page internal anchor Pith review arXiv 2023
[79]

Sparse gradient regularized deep retinex network for robust low-light image enhancement.IEEE Transactions on Image Processing, 30:2072–2086, 2021

Wenhan Yang, Wenjing Wang, Haofeng Huang, Shiqi Wang, and Jiaying Liu. Sparse gradient regularized deep retinex network for robust low-light image enhancement.IEEE Transactions on Image Processing, 30:2072–2086, 2021

work page 2072
[80]

Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024

Declan Campbell, Sunayana Rane, Tyler Giallanza, Camillo Nicolò De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven Frankland, Tom Griffiths, Jonathan D Cohen, et al. Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024

work page 2024

Showing first 80 references.