pith. sign in

arxiv: 2511.04570 · v2 · submitted 2025-11-06 · 💻 cs.CV · cs.CL

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Pith reviewed 2026-05-18 00:51 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords multimodal reasoningvideo generationunified modelsVideoThinkBenchvision-language modelsreasoning paradigmsSora
0
0 comments X

The pith

Video generation models can unify multimodal reasoning by using generated video sequences as a single medium for both visual dynamics and textual logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper advances a new paradigm called Thinking with Video that treats video generation models such as Sora-2 as reasoners capable of handling both vision-centric and text-centric problems. It introduces the VideoThinkBench benchmark to evaluate this approach on tasks ranging from eyeballing puzzles to MATH and MMMU. Evaluations demonstrate that Sora-2 matches state-of-the-art vision-language models on visual tasks and exceeds GPT-5 by 10 percent on puzzles while scoring 92 percent on MATH and 69.2 percent on MMMU. The work argues that video overcomes the static nature of images and the modality separation of text and vision, enabling more natural representation of continuous change. A reader would care because this points toward simpler, more integrated architectures for future multimodal systems.

Core claim

The paper claims that video generation models such as Sora-2 function as capable multimodal reasoners on VideoThinkBench, achieving 92 percent accuracy on MATH, 69.2 percent on MMMU, and surpassing GPT-5 by 10 percent on eyeballing puzzles, which supports the positioning of Thinking with Video as a potential unified multimodal understanding and generation paradigm.

What carries the argument

The Thinking with Video paradigm, which employs sequences of generated video frames as a unified medium to represent dynamic processes and multimodal information for reasoning.

If this is right

  • Separate vision-language and language models could be replaced by a single video generation backbone for tasks involving continuous change.
  • Self-consistency and in-context learning techniques can be applied directly to improve video-based reasoning outputs.
  • Benchmarks focused on dynamic processes would become more central than static image or text-only evaluations.
  • Unified models would naturally support both understanding queries and generating explanatory video sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This paradigm could extend to domains requiring real-time simulation of physical processes, such as planning in robotics.
  • Future work might test whether video generation models maintain performance when forced to output non-visual reasoning traces.
  • Integration with existing tools could allow video models to interleave frame generation with symbolic computation steps.

Load-bearing premise

That strong results on the new VideoThinkBench tasks reflect genuine reasoning rather than the model's skill at producing plausible video sequences drawn from training patterns.

What would settle it

A clear drop in Sora-2 performance on novel reasoning problems that require tracking unseen temporal relationships or logical steps not present in its training distribution.

read the original abstract

The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which covers both vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., GSM8K and MMMU). Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is comparable to state-of-the-art (SOTA) VLMs, and even surpasses GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyze the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings show that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a potential unified multimodal reasoning paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes 'Thinking with Video' as a new multimodal reasoning paradigm that uses video generation models such as Sora-2 to produce video sequences whose frames encode step-by-step solutions. It introduces the VideoThinkBench benchmark spanning vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., MATH, MMMU, GSM8K). The authors report that Sora-2 reaches 92% accuracy on MATH, 69.2% on MMMU, and exceeds GPT-5 by 10% on eyeballing puzzles, concluding that video generation models are promising unified multimodal understanding and generation systems.

Significance. If the performance genuinely reflects video-mediated reasoning rather than pattern retrieval, the work could open a new direction for unifying generation and understanding in dynamic visual sequences, extending beyond static images or text chains. The introduction of VideoThinkBench and the reported gains on established benchmarks would be of interest to the multimodal reasoning community.

major comments (3)
  1. [Abstract] Abstract: the reported accuracies (92% MATH, 69.2% MMMU, +10% on eyeballing puzzles) are presented without any description of the evaluation protocol, answer extraction procedure from generated frames, statistical controls, or task construction details, so it is impossible to verify that the numbers support the reasoning claims.
  2. [Experiments] The central claim that video generation constitutes a reasoning medium requires evidence that Sora-2 produces novel step-by-step derivations rather than high-probability visual sequences from pre-training. No ablation, control for prompt leakage, or comparison against canonical visual explanations of the same problems is described, leaving the performance numbers open to the alternative interpretation of pattern completion.
  3. [VideoThinkBench] VideoThinkBench section: the benchmark description does not address potential overlap between the vision-centric and text-centric tasks and common internet video content, nor does it report controls that would distinguish genuine inference from retrieval of training patterns.
minor comments (2)
  1. [Abstract] Clarify the exact pipeline for converting generated video frames into final answers (e.g., frame sampling, OCR, or downstream VLM extraction).
  2. [Analysis] Add a brief discussion of how self-consistency and in-context learning are implemented for the video generation model.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas where additional clarity and evidence would strengthen our claims about video generation models as a reasoning paradigm. We address each major comment point by point below, indicating revisions where the manuscript will be updated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported accuracies (92% MATH, 69.2% MMMU, +10% on eyeballing puzzles) are presented without any description of the evaluation protocol, answer extraction procedure from generated frames, statistical controls, or task construction details, so it is impossible to verify that the numbers support the reasoning claims.

    Authors: We agree that the abstract, due to length constraints, omits these details and that this limits immediate verifiability. The full manuscript describes the evaluation protocol, frame-based answer extraction, and task construction in the VideoThinkBench and Experiments sections. We will revise the abstract to incorporate a brief summary of the evaluation protocol, answer extraction method, and key task details, along with a note on statistical controls. revision: yes

  2. Referee: [Experiments] The central claim that video generation constitutes a reasoning medium requires evidence that Sora-2 produces novel step-by-step derivations rather than high-probability visual sequences from pre-training. No ablation, control for prompt leakage, or comparison against canonical visual explanations of the same problems is described, leaving the performance numbers open to the alternative interpretation of pattern completion.

    Authors: This concern is valid and central to the interpretation of our results. The manuscript includes a systematic analysis of ability sources and qualitative examples of step-by-step video derivations. However, we did not present explicit ablations for prompt leakage or side-by-side comparisons to canonical visual explanations. We will add these controls and ablations in the revised Experiments section to better rule out pure pattern completion. revision: yes

  3. Referee: [VideoThinkBench] VideoThinkBench section: the benchmark description does not address potential overlap between the vision-centric and text-centric tasks and common internet video content, nor does it report controls that would distinguish genuine inference from retrieval of training patterns.

    Authors: We acknowledge that explicit discussion of overlap and retrieval controls is missing. The benchmark was constructed with novel or modified tasks, particularly for vision-centric puzzles, to minimize direct matches to common video content. We will revise the VideoThinkBench section to include a dedicated discussion of potential overlaps with internet video data and report additional controls, such as performance on held-out problem variants, to support the inference interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on newly introduced benchmark with external model.

full rationale

The paper proposes the 'Thinking with Video' paradigm conceptually, introduces VideoThinkBench as a new evaluation suite covering vision- and text-centric tasks, and reports direct performance numbers for Sora-2 (92% MATH, 69.2% MMMU, +10% on eyeballing puzzles) plus comparisons to VLMs and GPT-5. No equations, fitted parameters, or derivations appear; the central claim rests on benchmark results rather than any self-definitional loop, renamed prediction, or load-bearing self-citation chain. The evaluation chain is self-contained against external models and tasks, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on interpreting video generation outputs as reasoning traces and on the new benchmark accurately capturing multimodal reasoning, with no explicit free parameters or invented entities stated.

axioms (1)
  • domain assumption Video frames can serve as a unified medium that inherently captures dynamic reasoning processes without modality-specific training.
    Invoked when claiming video generation enables unified understanding and generation beyond static images or text.

pith-pipeline@v0.9.0 · 5861 in / 1220 out tokens · 54039 ms · 2026-05-18T00:51:24.114476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Video Models Can Reason with Verifiable Rewards

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Ma...

  2. OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

    cs.CV 2026-04 unverdicted novelty 6.0

    OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.

  3. Kling-Omni Technical Report

    cs.CV 2025-12 unverdicted novelty 6.0

    Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 3 Pith papers · 22 internal anchors

  1. [1]

    System card: Claude opus 4 & claude sonnet 4

    Anthropic. System card: Claude opus 4 & claude sonnet 4. Technical report, Anthropic, May 2025. URL https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. URLhttps://arxiv.org/abs/2311.15127

  3. [3]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025. URLhttps://arxiv.org/abs/2503.09567

  4. [4]

    PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

    Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns.arXivpreprintarXiv:2403.13315, 2024

  5. [5]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025. URLhttps://arxiv.org/abs/2505.11831

  6. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXivpreprintarXiv:2507.06261, 2025. URLhttps://arxiv.org/abs/2507.06261

  8. [9]

    Emu3.5: Native Multimodal Models are World Learners

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, and Xinlong Wang. Emu3.5: Native multimodal models are world learners, 2025. URLh...

  9. [10]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  10. [11]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. URL https://arxiv.org/abs/2505.14683

  11. [12]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A survey on in-context learning.arXiv preprintarXiv:2301.00234, 2024. Updated version v6, October 2024

  12. [13]

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

    Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025

  13. [14]

    Veo 3.https://aistudio.google.com/models/veo-3, 2025

    Google. Veo 3.https://aistudio.google.com/models/veo-3, 2025. Accessed on November 7, 2025

  14. [15]

    Gemini 2.5 flash & 2.5 flash image model card

    Google DeepMind. Gemini 2.5 flash & 2.5 flash image model card. Technical report, Google DeepMind, August

  15. [16]

    URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card. pdf. Last updated: August 27, 2025

  16. [17]

    Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

    ZiyuGuo,XinyanChen,RenruiZhang,RuichuanAn,YuQi,DongzhiJiang,XiangtaiLi,ManyuanZhang,Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark, 2025. URLhttps://arxiv.org/abs/2510.26802

  17. [18]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXivpreprint arXiv:2009.03300, 2020

  18. [19]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/ 2103.03874

  19. [20]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    WeijieKong, QiTian, ZijianZhang, RoxMin, ZuozhuoDai, JinZhou, JiangfengXiong, XinLi, BoWu, JianweiZhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang,...

  20. [21]

    Enhancing advanced visual reasoning ability of large language models, 2024

    Zhiyuan Li, Dongnan Liu, Chaoyi Zhang, Heng Wang, Tengfei Xue, and Weidong Cai. Enhancing advanced visual reasoning ability of large language models, 2024. URLhttps://arxiv.org/abs/2409.13980

  21. [22]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computervision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computervision, pages 216–233. Springer, 2024

  22. [23]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprintarXiv:2310.02255, 2023

  23. [24]

    GPT-4o System Card

    OpenAI. Gpt-4o system card.arXivpreprintarXiv:2410.21276, 2024

  24. [25]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. Technical report, OpenAI, August 2025. URL https://cdn.openai.com/pdf/ 8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf

  25. [26]

    Learning to reason with llms, 2025

    OpenAI. Learning to reason with llms, 2025. URL https://openai.com/zh-Hans-CN/index/ learning-to-reason-with-llms/. Accessed: 2025

  26. [27]

    OpenAI o3 and o4-mini System Card

    OpenAI. OpenAI o3 and o4-mini System Card. Technical report, OpenAI, April 2025. URLhttps://cdn.openai. com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf . Accessed: 2025-11-01

  27. [28]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. 18

  28. [29]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirstConferenceonLanguage Modeling, 2024

  29. [30]

    Introducing Gen-3 Alpha: A New Frontier for Video Generation.https://runwayml.com/ research/introducing-gen-3-alpha, June 2024

    Runway Research. Introducing Gen-3 Alpha: A New Frontier for Video Generation.https://runwayml.com/ research/introducing-gen-3-alpha, June 2024. Accessed on November 7, 2025

  30. [31]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URLhttps://arxiv.org/abs/2210.09261

  31. [32]

    Code2logic: Game-code-driven data synthesis for enhancing vlms general reasoning.arXiv preprint arXiv:2505.13886, 2025

    Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, YuyangLu,ChaoranTao,ZhiyuanGuo,JizhouYu,TianhaoCheng,ZhihengXi,ChanghaoJiang,ZhangyueYin,Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. Game-rl: Synthesizing multimodalverifiablegamedatatoboostvlms’generalr...

  32. [34]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

  33. [35]

    Measuring multimodal mathematical reasoning with math-vision dataset.Advancesin Neural Information ProcessingSystems, 37:95095–95169, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advancesin Neural Information ProcessingSystems, 37:95095–95169, 2024

  34. [36]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URLhttps: //arxiv.org/abs/2203.11171

  35. [37]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. AdvancesinNeuralInformationProcessingSystems, 37:95266–95290, 2024

  36. [38]

    Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels. Advancesinneuralinformationprocessing systems, 35:24824–24837, 2022

  37. [39]

    Video models are zero-shot learners and reasoners

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, andRobertGeirhos. Videomodelsarezero-shotlearnersandreasoners. arXivpreprintarXiv:2509.20328, 2025

  38. [40]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

  39. [41]

    Demystifying Long Chain-of-Thought Reasoning in LLMs

    Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. URLhttps://arxiv.org/abs/2502.03373

  40. [42]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  41. [43]

    Rewatch-r1: Boosting complex video reasoning in large vision-language models through agentic data synthesis,

    Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, and Bo Zheng. Rewatch-r1: Boosting complex video reasoning in large vision-language models through agentic data synthesis,

  42. [44]

    URL https://arxiv.org/abs/2509.23652. 19

  43. [45]

    arXiv preprint arXiv:2507.09876 , year=

    Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei, Wanxiang Che, and Libo Qin. Vitcot: Video-text interleaved chain-of-thought for boosting video understanding in large language models, 2025. URLhttps: //arxiv.org/abs/2507.09876

  44. [46]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models, 2024. URLhttps://arxiv.org/abs/2302.00923

  45. [47]

    Exploring the compositional deficiency of large language models in mathematical reasoning.arXivpreprintarXiv:2405.06680, 2024

    Jun Zhao, Jingqi Tong, Yurong Mou, Ming Zhang, Qi Zhang, and Xuanjing Huang. Exploring the compositional deficiency of large language models in mathematical reasoning.arXivpreprintarXiv:2405.06680, 2024

  46. [48]

    coverage difference

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural informationprocessingsystems, 36:46595–46623, 2023. 20 Appendix Appendix Contents A VideoThinkBench Sample Distribution . . . . . . . . . . . . . ...

  47. [49]

    The answer is

    First, determine the visible answer from the image using this priority: - If there is an explicit statement indicating the answer (e.g., “The answer is ...”), use that answer. - Else, check for an answer marked by a symbol such as box, circle, underline, arrow, etc. If multiple positions are marked but show different results, respond ’no’ immediately. - E...

  48. [50]

    Compare the visible answer in the image with the provided correct answer

  49. [51]

    Be strict but reasonable - minor formatting differences are acceptable if the core answer is correct

  50. [52]

    For multiple choice questions, check if the correct option (A, B, C, etc.) is clearly marked or highlighted

  51. [53]

    4” vs “4.0

    For numerical answers, check if the number matches (ignore minor formatting like “4” vs “4.0”)

  52. [54]

    For text answers, check if the key content matches (ignore case sensitivity and minor punctuation)

  53. [55]

    Your task is to determine if an audio transcript from a solution video contains the correct answer to a given question

    You must respond with ONLY ’yes’ or ’no’, nothing else User instruction prompt: Question: {question} Correct answer: {correct_answer} Does the image show the correct answer? (The last frame of the generated video is also provided for the model.) Prompt for Evaluating the Answer from the Audio System prompt: You are an expert answer checker for educational...

  54. [56]

    Check if the transcript explicitly states or clearly implies the correct answer

  55. [57]

    Be lenient with phrasing - the transcript may explain the answer in different words

  56. [58]

    For multiple choice questions, check if the correct option (A, B, C, etc.) is mentioned

  57. [59]

    For numerical answers, check if the number is stated (ignore surrounding explanation)

  58. [60]

    For text answers, check if the key concept is explained correctly

  59. [61]

    the correct answer is

    Common phrases like “the correct answer is...”, “the answer is...”, “it should be...” indicate the answer

  60. [62]

    A” to “E

    You must respond with ONLY ’yes’ or ’no’, nothing else User instruction prompt: Question: {question} Correct answer: {correct_answer} Audio transcript: {transcript} Does the transcript provide the correct answer? B.4.2 Human Alignment Check for Evaluation We performed a human alignment check on a sample of 173 responses across the text-centric tasks to va...

  61. [63]

    Completely Correct:The solution has a clear and correct process without any errors

  62. [64]

    Logic Correct with Writing Errors:The solution contains expressional mistakes, but the overall logic is identifiable and correct

  63. [65]

    Unreadable or Incorrect Logic:The writing is too disorganized or contains too many errors to discern the reasoning, or it exhibits clear logical mistakes or major omissions

  64. [66]

    Missing Solution Process:Necessary steps are absent; apart from the final answer, the response is blank or contains only meaningless scribbles (i.e., lines, circles, etc)

  65. [67]

    Process Unnecessary:The problem itself does not require a written process to solve. C.4.2 Examples Figure 14 illustrates examples for four of the five categories: C.5 Manual Evaluation of ARC-AGI-2 To provide a more fine-grained assessment of Sora-2’s performance on ARC-AGI-2 beyond binary correctness, we manually evaluated 100 randomly selected samples a...