Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Hangcheng Li; Jingqi Tong; Jun Zhao; Ming Zhang; Mingzhe Li; Qiguang Chen; Tianyi Liang; Xiaomeng Hu; Xinchi Chen; Xipeng Qiu

arxiv: 2511.04570 · v2 · submitted 2025-11-06 · 💻 cs.CV · cs.CL

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong , Yurong Mou , Hangcheng Li , Mingzhe Li , Yongzhuo Yang , Ming Zhang , Qiguang Chen , Tianyi Liang

show 6 more authors

Xiaomeng Hu Yining Zheng Xinchi Chen Jun Zhao Xuanjing Huang Xipeng Qiu

This is my paper

Pith reviewed 2026-05-18 00:51 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords multimodal reasoningvideo generationunified modelsVideoThinkBenchvision-language modelsreasoning paradigmsSora

0 comments

The pith

Video generation models can unify multimodal reasoning by using generated video sequences as a single medium for both visual dynamics and textual logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper advances a new paradigm called Thinking with Video that treats video generation models such as Sora-2 as reasoners capable of handling both vision-centric and text-centric problems. It introduces the VideoThinkBench benchmark to evaluate this approach on tasks ranging from eyeballing puzzles to MATH and MMMU. Evaluations demonstrate that Sora-2 matches state-of-the-art vision-language models on visual tasks and exceeds GPT-5 by 10 percent on puzzles while scoring 92 percent on MATH and 69.2 percent on MMMU. The work argues that video overcomes the static nature of images and the modality separation of text and vision, enabling more natural representation of continuous change. A reader would care because this points toward simpler, more integrated architectures for future multimodal systems.

Core claim

The paper claims that video generation models such as Sora-2 function as capable multimodal reasoners on VideoThinkBench, achieving 92 percent accuracy on MATH, 69.2 percent on MMMU, and surpassing GPT-5 by 10 percent on eyeballing puzzles, which supports the positioning of Thinking with Video as a potential unified multimodal understanding and generation paradigm.

What carries the argument

The Thinking with Video paradigm, which employs sequences of generated video frames as a unified medium to represent dynamic processes and multimodal information for reasoning.

If this is right

Separate vision-language and language models could be replaced by a single video generation backbone for tasks involving continuous change.
Self-consistency and in-context learning techniques can be applied directly to improve video-based reasoning outputs.
Benchmarks focused on dynamic processes would become more central than static image or text-only evaluations.
Unified models would naturally support both understanding queries and generating explanatory video sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This paradigm could extend to domains requiring real-time simulation of physical processes, such as planning in robotics.
Future work might test whether video generation models maintain performance when forced to output non-visual reasoning traces.
Integration with existing tools could allow video models to interleave frame generation with symbolic computation steps.

Load-bearing premise

That strong results on the new VideoThinkBench tasks reflect genuine reasoning rather than the model's skill at producing plausible video sequences drawn from training patterns.

What would settle it

A clear drop in Sora-2 performance on novel reasoning problems that require tracking unseen temporal relationships or logical steps not present in its training distribution.

read the original abstract

The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which covers both vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., GSM8K and MMMU). Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is comparable to state-of-the-art (SOTA) VLMs, and even surpasses GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyze the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings show that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a potential unified multimodal reasoning paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames video generation as a unified reasoning medium with a new benchmark, but the high accuracy claims rest on thin evaluation details that do not yet rule out pattern retrieval.

read the letter

The main point is that this work tries to position video generation models like Sora-2 as a single architecture for multimodal reasoning. They argue that video can handle dynamic processes better than static images or text alone, and they back the idea with VideoThinkBench, which mixes vision puzzles and text problems such as MATH and MMMU. Sora-2 reportedly reaches 92% on MATH and beats GPT-5 by 10% on eyeballing tasks, with some gains from self-consistency and in-context learning.

Referee Report

3 major / 2 minor

Summary. The paper proposes 'Thinking with Video' as a new multimodal reasoning paradigm that uses video generation models such as Sora-2 to produce video sequences whose frames encode step-by-step solutions. It introduces the VideoThinkBench benchmark spanning vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., MATH, MMMU, GSM8K). The authors report that Sora-2 reaches 92% accuracy on MATH, 69.2% on MMMU, and exceeds GPT-5 by 10% on eyeballing puzzles, concluding that video generation models are promising unified multimodal understanding and generation systems.

Significance. If the performance genuinely reflects video-mediated reasoning rather than pattern retrieval, the work could open a new direction for unifying generation and understanding in dynamic visual sequences, extending beyond static images or text chains. The introduction of VideoThinkBench and the reported gains on established benchmarks would be of interest to the multimodal reasoning community.

major comments (3)

[Abstract] Abstract: the reported accuracies (92% MATH, 69.2% MMMU, +10% on eyeballing puzzles) are presented without any description of the evaluation protocol, answer extraction procedure from generated frames, statistical controls, or task construction details, so it is impossible to verify that the numbers support the reasoning claims.
[Experiments] The central claim that video generation constitutes a reasoning medium requires evidence that Sora-2 produces novel step-by-step derivations rather than high-probability visual sequences from pre-training. No ablation, control for prompt leakage, or comparison against canonical visual explanations of the same problems is described, leaving the performance numbers open to the alternative interpretation of pattern completion.
[VideoThinkBench] VideoThinkBench section: the benchmark description does not address potential overlap between the vision-centric and text-centric tasks and common internet video content, nor does it report controls that would distinguish genuine inference from retrieval of training patterns.

minor comments (2)

[Abstract] Clarify the exact pipeline for converting generated video frames into final answers (e.g., frame sampling, OCR, or downstream VLM extraction).
[Analysis] Add a brief discussion of how self-consistency and in-context learning are implemented for the video generation model.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas where additional clarity and evidence would strengthen our claims about video generation models as a reasoning paradigm. We address each major comment point by point below, indicating revisions where the manuscript will be updated.

read point-by-point responses

Referee: [Abstract] Abstract: the reported accuracies (92% MATH, 69.2% MMMU, +10% on eyeballing puzzles) are presented without any description of the evaluation protocol, answer extraction procedure from generated frames, statistical controls, or task construction details, so it is impossible to verify that the numbers support the reasoning claims.

Authors: We agree that the abstract, due to length constraints, omits these details and that this limits immediate verifiability. The full manuscript describes the evaluation protocol, frame-based answer extraction, and task construction in the VideoThinkBench and Experiments sections. We will revise the abstract to incorporate a brief summary of the evaluation protocol, answer extraction method, and key task details, along with a note on statistical controls. revision: yes
Referee: [Experiments] The central claim that video generation constitutes a reasoning medium requires evidence that Sora-2 produces novel step-by-step derivations rather than high-probability visual sequences from pre-training. No ablation, control for prompt leakage, or comparison against canonical visual explanations of the same problems is described, leaving the performance numbers open to the alternative interpretation of pattern completion.

Authors: This concern is valid and central to the interpretation of our results. The manuscript includes a systematic analysis of ability sources and qualitative examples of step-by-step video derivations. However, we did not present explicit ablations for prompt leakage or side-by-side comparisons to canonical visual explanations. We will add these controls and ablations in the revised Experiments section to better rule out pure pattern completion. revision: yes
Referee: [VideoThinkBench] VideoThinkBench section: the benchmark description does not address potential overlap between the vision-centric and text-centric tasks and common internet video content, nor does it report controls that would distinguish genuine inference from retrieval of training patterns.

Authors: We acknowledge that explicit discussion of overlap and retrieval controls is missing. The benchmark was constructed with novel or modified tasks, particularly for vision-centric puzzles, to minimize direct matches to common video content. We will revise the VideoThinkBench section to include a dedicated discussion of potential overlaps with internet video data and report additional controls, such as performance on held-out problem variants, to support the inference interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on newly introduced benchmark with external model.

full rationale

The paper proposes the 'Thinking with Video' paradigm conceptually, introduces VideoThinkBench as a new evaluation suite covering vision- and text-centric tasks, and reports direct performance numbers for Sora-2 (92% MATH, 69.2% MMMU, +10% on eyeballing puzzles) plus comparisons to VLMs and GPT-5. No equations, fitted parameters, or derivations appear; the central claim rests on benchmark results rather than any self-definitional loop, renamed prediction, or load-bearing self-citation chain. The evaluation chain is self-contained against external models and tasks, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on interpreting video generation outputs as reasoning traces and on the new benchmark accurately capturing multimodal reasoning, with no explicit free parameters or invented entities stated.

axioms (1)

domain assumption Video frames can serve as a unified medium that inherently captures dynamic reasoning processes without modality-specific training.
Invoked when claiming video generation enables unified understanding and generation beyond static images or text.

pith-pipeline@v0.9.0 · 5861 in / 1220 out tokens · 54039 ms · 2026-05-18T00:51:24.114476+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose 'Thinking with Video', a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Sora-2 achieves 92% accuracy on MATH and 69.2% on MMMU

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Video Models Can Reason with Verifiable Rewards
cs.CV 2026-05 unverdicted novelty 6.0

VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Ma...
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
cs.CV 2026-04 unverdicted novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
Kling-Omni Technical Report
cs.CV 2025-12 unverdicted novelty 6.0

Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 3 Pith papers · 22 internal anchors

[1]

System card: Claude opus 4 & claude sonnet 4

Anthropic. System card: Claude opus 4 & claude sonnet 4. Technical report, Anthropic, May 2025. URL https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf

work page 2025
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. URLhttps://arxiv.org/abs/2311.15127

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025. URLhttps://arxiv.org/abs/2503.09567

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns.arXivpreprintarXiv:2403.13315, 2024

work page arXiv 2024
[5]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025. URLhttps://arxiv.org/abs/2505.11831

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXivpreprintarXiv:2507.06261, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, and Xinlong Wang. Emu3.5: Native multimodal models are world learners, 2025. URLh...

work page internal anchor Pith review arXiv 2025
[10]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. URL https://arxiv.org/abs/2505.14683

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A survey on in-context learning.arXiv preprintarXiv:2301.00234, 2024. Updated version v6, October 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xinrun Du, Yifan Yao, Kaĳing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Veo 3.https://aistudio.google.com/models/veo-3, 2025

Google. Veo 3.https://aistudio.google.com/models/veo-3, 2025. Accessed on November 7, 2025

work page 2025
[15]

Gemini 2.5 flash & 2.5 flash image model card

Google DeepMind. Gemini 2.5 flash & 2.5 flash image model card. Technical report, Google DeepMind, August

work page
[16]

URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card. pdf. Last updated: August 27, 2025

work page 2025
[17]

Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

ZiyuGuo,XinyanChen,RenruiZhang,RuichuanAn,YuQi,DongzhiJiang,XiangtaiLi,ManyuanZhang,Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark, 2025. URLhttps://arxiv.org/abs/2510.26802

work page arXiv 2025
[18]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXivpreprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[19]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/ 2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

WeĳieKong, QiTian, ZĳianZhang, RoxMin, ZuozhuoDai, JinZhou, JiangfengXiong, XinLi, BoWu, JianweiZhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Enhancing advanced visual reasoning ability of large language models, 2024

Zhiyuan Li, Dongnan Liu, Chaoyi Zhang, Heng Wang, Tengfei Xue, and Weidong Cai. Enhancing advanced visual reasoning ability of large language models, 2024. URLhttps://arxiv.org/abs/2409.13980

work page arXiv 2024
[22]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computervision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computervision, pages 216–233. Springer, 2024

work page 2024
[23]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprintarXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

GPT-4o System Card

OpenAI. Gpt-4o system card.arXivpreprintarXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, OpenAI, August 2025. URL https://cdn.openai.com/pdf/ 8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf

work page 2025
[26]

Learning to reason with llms, 2025

OpenAI. Learning to reason with llms, 2025. URL https://openai.com/zh-Hans-CN/index/ learning-to-reason-with-llms/. Accessed: 2025

work page 2025
[27]

OpenAI o3 and o4-mini System Card

OpenAI. OpenAI o3 and o4-mini System Card. Technical report, OpenAI, April 2025. URLhttps://cdn.openai. com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf . Accessed: 2025-11-01

work page 2025
[28]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. 18

work page 2018
[29]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirstConferenceonLanguage Modeling, 2024

work page 2024
[30]

Introducing Gen-3 Alpha: A New Frontier for Video Generation.https://runwayml.com/ research/introducing-gen-3-alpha, June 2024

Runway Research. Introducing Gen-3 Alpha: A New Frontier for Video Generation.https://runwayml.com/ research/introducing-gen-3-alpha, June 2024. Accessed on November 7, 2025

work page 2024
[31]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URLhttps://arxiv.org/abs/2210.09261

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Code2logic: Game-code-driven data synthesis for enhancing vlms general reasoning.arXiv preprint arXiv:2505.13886, 2025

Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, YuyangLu,ChaoranTao,ZhiyuanGuo,JizhouYu,TianhaoCheng,ZhihengXi,ChanghaoJiang,ZhangyueYin,Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. Game-rl: Synthesizing multimodalverifiablegamedatatoboostvlms’generalr...

work page arXiv 2025
[34]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Measuring multimodal mathematical reasoning with math-vision dataset.Advancesin Neural Information ProcessingSystems, 37:95095–95169, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advancesin Neural Information ProcessingSystems, 37:95095–95169, 2024

work page 2024
[36]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URLhttps: //arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. AdvancesinNeuralInformationProcessingSystems, 37:95266–95290, 2024

work page 2024
[38]

Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels. Advancesinneuralinformationprocessing systems, 35:24824–24837, 2022

work page 2022
[39]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, andRobertGeirhos. Videomodelsarezero-shotlearnersandreasoners. arXivpreprintarXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. URLhttps://arxiv.org/abs/2502.03373

work page internal anchor Pith review arXiv 2025
[42]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024
[43]

Rewatch-r1: Boosting complex video reasoning in large vision-language models through agentic data synthesis,

Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, and Bo Zheng. Rewatch-r1: Boosting complex video reasoning in large vision-language models through agentic data synthesis,

work page
[44]

URL https://arxiv.org/abs/2509.23652. 19

work page arXiv
[45]

arXiv preprint arXiv:2507.09876 , year=

Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei, Wanxiang Che, and Libo Qin. Vitcot: Video-text interleaved chain-of-thought for boosting video understanding in large language models, 2025. URLhttps: //arxiv.org/abs/2507.09876

work page arXiv 2025
[46]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models, 2024. URLhttps://arxiv.org/abs/2302.00923

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Exploring the compositional deficiency of large language models in mathematical reasoning.arXivpreprintarXiv:2405.06680, 2024

Jun Zhao, Jingqi Tong, Yurong Mou, Ming Zhang, Qi Zhang, and Xuanjing Huang. Exploring the compositional deficiency of large language models in mathematical reasoning.arXivpreprintarXiv:2405.06680, 2024

work page arXiv 2024
[48]

coverage difference

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural informationprocessingsystems, 36:46595–46623, 2023. 20 Appendix Appendix Contents A VideoThinkBench Sample Distribution . . . . . . . . . . . . . ...

work page 2023
[49]

The answer is

First, determine the visible answer from the image using this priority: - If there is an explicit statement indicating the answer (e.g., “The answer is ...”), use that answer. - Else, check for an answer marked by a symbol such as box, circle, underline, arrow, etc. If multiple positions are marked but show different results, respond ’no’ immediately. - E...

work page
[50]

Compare the visible answer in the image with the provided correct answer

work page
[51]

Be strict but reasonable - minor formatting differences are acceptable if the core answer is correct

work page
[52]

For multiple choice questions, check if the correct option (A, B, C, etc.) is clearly marked or highlighted

work page
[53]

4” vs “4.0

For numerical answers, check if the number matches (ignore minor formatting like “4” vs “4.0”)

work page
[54]

For text answers, check if the key content matches (ignore case sensitivity and minor punctuation)

work page
[55]

Your task is to determine if an audio transcript from a solution video contains the correct answer to a given question

You must respond with ONLY ’yes’ or ’no’, nothing else User instruction prompt: Question: {question} Correct answer: {correct_answer} Does the image show the correct answer? (The last frame of the generated video is also provided for the model.) Prompt for Evaluating the Answer from the Audio System prompt: You are an expert answer checker for educational...

work page
[56]

Check if the transcript explicitly states or clearly implies the correct answer

work page
[57]

Be lenient with phrasing - the transcript may explain the answer in different words

work page
[58]

For multiple choice questions, check if the correct option (A, B, C, etc.) is mentioned

work page
[59]

For numerical answers, check if the number is stated (ignore surrounding explanation)

work page
[60]

For text answers, check if the key concept is explained correctly

work page
[61]

the correct answer is

Common phrases like “the correct answer is...”, “the answer is...”, “it should be...” indicate the answer

work page
[62]

A” to “E

You must respond with ONLY ’yes’ or ’no’, nothing else User instruction prompt: Question: {question} Correct answer: {correct_answer} Audio transcript: {transcript} Does the transcript provide the correct answer? B.4.2 Human Alignment Check for Evaluation We performed a human alignment check on a sample of 173 responses across the text-centric tasks to va...

work page
[63]

Completely Correct:The solution has a clear and correct process without any errors

work page
[64]

Logic Correct with Writing Errors:The solution contains expressional mistakes, but the overall logic is identifiable and correct

work page
[65]

Unreadable or Incorrect Logic:The writing is too disorganized or contains too many errors to discern the reasoning, or it exhibits clear logical mistakes or major omissions

work page
[66]

Missing Solution Process:Necessary steps are absent; apart from the final answer, the response is blank or contains only meaningless scribbles (i.e., lines, circles, etc)

work page
[67]

Process Unnecessary:The problem itself does not require a written process to solve. C.4.2 Examples Figure 14 illustrates examples for four of the five categories: C.5 Manual Evaluation of ARC-AGI-2 To provide a more fine-grained assessment of Sora-2’s performance on ARC-AGI-2 beyond binary correctness, we manually evaluated 100 randomly selected samples a...

work page

[1] [1]

System card: Claude opus 4 & claude sonnet 4

Anthropic. System card: Claude opus 4 & claude sonnet 4. Technical report, Anthropic, May 2025. URL https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf

work page 2025

[2] [2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. URLhttps://arxiv.org/abs/2311.15127

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025. URLhttps://arxiv.org/abs/2503.09567

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns

Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns.arXivpreprintarXiv:2403.13315, 2024

work page arXiv 2024

[5] [5]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025. URLhttps://arxiv.org/abs/2505.11831

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXivpreprintarXiv:2507.06261, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [9]

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, and Xinlong Wang. Emu3.5: Native multimodal models are world learners, 2025. URLh...

work page internal anchor Pith review arXiv 2025

[9] [10]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [11]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. URL https://arxiv.org/abs/2505.14683

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [12]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A survey on in-context learning.arXiv preprintarXiv:2301.00234, 2024. Updated version v6, October 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [13]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xinrun Du, Yifan Yao, Kaĳing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [14]

Veo 3.https://aistudio.google.com/models/veo-3, 2025

Google. Veo 3.https://aistudio.google.com/models/veo-3, 2025. Accessed on November 7, 2025

work page 2025

[14] [15]

Gemini 2.5 flash & 2.5 flash image model card

Google DeepMind. Gemini 2.5 flash & 2.5 flash image model card. Technical report, Google DeepMind, August

work page

[15] [16]

URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card. pdf. Last updated: August 27, 2025

work page 2025

[16] [17]

Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

ZiyuGuo,XinyanChen,RenruiZhang,RuichuanAn,YuQi,DongzhiJiang,XiangtaiLi,ManyuanZhang,Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark, 2025. URLhttps://arxiv.org/abs/2510.26802

work page arXiv 2025

[17] [18]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXivpreprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[18] [19]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/ 2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [20]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

WeĳieKong, QiTian, ZĳianZhang, RoxMin, ZuozhuoDai, JinZhou, JiangfengXiong, XinLi, BoWu, JianweiZhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [21]

Enhancing advanced visual reasoning ability of large language models, 2024

Zhiyuan Li, Dongnan Liu, Chaoyi Zhang, Heng Wang, Tengfei Xue, and Weidong Cai. Enhancing advanced visual reasoning ability of large language models, 2024. URLhttps://arxiv.org/abs/2409.13980

work page arXiv 2024

[21] [22]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computervision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computervision, pages 216–233. Springer, 2024

work page 2024

[22] [23]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprintarXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [24]

GPT-4o System Card

OpenAI. Gpt-4o system card.arXivpreprintarXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [25]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, OpenAI, August 2025. URL https://cdn.openai.com/pdf/ 8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf

work page 2025

[25] [26]

Learning to reason with llms, 2025

OpenAI. Learning to reason with llms, 2025. URL https://openai.com/zh-Hans-CN/index/ learning-to-reason-with-llms/. Accessed: 2025

work page 2025

[26] [27]

OpenAI o3 and o4-mini System Card

OpenAI. OpenAI o3 and o4-mini System Card. Technical report, OpenAI, April 2025. URLhttps://cdn.openai. com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf . Accessed: 2025-11-01

work page 2025

[27] [28]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. 18

work page 2018

[28] [29]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirstConferenceonLanguage Modeling, 2024

work page 2024

[29] [30]

Introducing Gen-3 Alpha: A New Frontier for Video Generation.https://runwayml.com/ research/introducing-gen-3-alpha, June 2024

Runway Research. Introducing Gen-3 Alpha: A New Frontier for Video Generation.https://runwayml.com/ research/introducing-gen-3-alpha, June 2024. Accessed on November 7, 2025

work page 2024

[30] [31]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URLhttps://arxiv.org/abs/2210.09261

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [32]

Code2logic: Game-code-driven data synthesis for enhancing vlms general reasoning.arXiv preprint arXiv:2505.13886, 2025

Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, YuyangLu,ChaoranTao,ZhiyuanGuo,JizhouYu,TianhaoCheng,ZhihengXi,ChanghaoJiang,ZhangyueYin,Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. Game-rl: Synthesizing multimodalverifiablegamedatatoboostvlms’generalr...

work page arXiv 2025

[32] [34]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [35]

Measuring multimodal mathematical reasoning with math-vision dataset.Advancesin Neural Information ProcessingSystems, 37:95095–95169, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advancesin Neural Information ProcessingSystems, 37:95095–95169, 2024

work page 2024

[34] [36]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URLhttps: //arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [37]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. AdvancesinNeuralInformationProcessingSystems, 37:95266–95290, 2024

work page 2024

[36] [38]

Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels. Advancesinneuralinformationprocessing systems, 35:24824–24837, 2022

work page 2022

[37] [39]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, andRobertGeirhos. Videomodelsarezero-shotlearnersandreasoners. arXivpreprintarXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [40]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [41]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. URLhttps://arxiv.org/abs/2502.03373

work page internal anchor Pith review arXiv 2025

[40] [42]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024

[41] [43]

Rewatch-r1: Boosting complex video reasoning in large vision-language models through agentic data synthesis,

Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, and Bo Zheng. Rewatch-r1: Boosting complex video reasoning in large vision-language models through agentic data synthesis,

work page

[42] [44]

URL https://arxiv.org/abs/2509.23652. 19

work page arXiv

[43] [45]

arXiv preprint arXiv:2507.09876 , year=

Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei, Wanxiang Che, and Libo Qin. Vitcot: Video-text interleaved chain-of-thought for boosting video understanding in large language models, 2025. URLhttps: //arxiv.org/abs/2507.09876

work page arXiv 2025

[44] [46]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models, 2024. URLhttps://arxiv.org/abs/2302.00923

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [47]

Exploring the compositional deficiency of large language models in mathematical reasoning.arXivpreprintarXiv:2405.06680, 2024

Jun Zhao, Jingqi Tong, Yurong Mou, Ming Zhang, Qi Zhang, and Xuanjing Huang. Exploring the compositional deficiency of large language models in mathematical reasoning.arXivpreprintarXiv:2405.06680, 2024

work page arXiv 2024

[46] [48]

coverage difference

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural informationprocessingsystems, 36:46595–46623, 2023. 20 Appendix Appendix Contents A VideoThinkBench Sample Distribution . . . . . . . . . . . . . ...

work page 2023

[47] [49]

The answer is

First, determine the visible answer from the image using this priority: - If there is an explicit statement indicating the answer (e.g., “The answer is ...”), use that answer. - Else, check for an answer marked by a symbol such as box, circle, underline, arrow, etc. If multiple positions are marked but show different results, respond ’no’ immediately. - E...

work page

[48] [50]

Compare the visible answer in the image with the provided correct answer

work page

[49] [51]

Be strict but reasonable - minor formatting differences are acceptable if the core answer is correct

work page

[50] [52]

For multiple choice questions, check if the correct option (A, B, C, etc.) is clearly marked or highlighted

work page

[51] [53]

4” vs “4.0

For numerical answers, check if the number matches (ignore minor formatting like “4” vs “4.0”)

work page

[52] [54]

For text answers, check if the key content matches (ignore case sensitivity and minor punctuation)

work page

[53] [55]

Your task is to determine if an audio transcript from a solution video contains the correct answer to a given question

You must respond with ONLY ’yes’ or ’no’, nothing else User instruction prompt: Question: {question} Correct answer: {correct_answer} Does the image show the correct answer? (The last frame of the generated video is also provided for the model.) Prompt for Evaluating the Answer from the Audio System prompt: You are an expert answer checker for educational...

work page

[54] [56]

Check if the transcript explicitly states or clearly implies the correct answer

work page

[55] [57]

Be lenient with phrasing - the transcript may explain the answer in different words

work page

[56] [58]

For multiple choice questions, check if the correct option (A, B, C, etc.) is mentioned

work page

[57] [59]

For numerical answers, check if the number is stated (ignore surrounding explanation)

work page

[58] [60]

For text answers, check if the key concept is explained correctly

work page

[59] [61]

the correct answer is

Common phrases like “the correct answer is...”, “the answer is...”, “it should be...” indicate the answer

work page

[60] [62]

A” to “E

You must respond with ONLY ’yes’ or ’no’, nothing else User instruction prompt: Question: {question} Correct answer: {correct_answer} Audio transcript: {transcript} Does the transcript provide the correct answer? B.4.2 Human Alignment Check for Evaluation We performed a human alignment check on a sample of 173 responses across the text-centric tasks to va...

work page

[61] [63]

Completely Correct:The solution has a clear and correct process without any errors

work page

[62] [64]

Logic Correct with Writing Errors:The solution contains expressional mistakes, but the overall logic is identifiable and correct

work page

[63] [65]

Unreadable or Incorrect Logic:The writing is too disorganized or contains too many errors to discern the reasoning, or it exhibits clear logical mistakes or major omissions

work page

[64] [66]

Missing Solution Process:Necessary steps are absent; apart from the final answer, the response is blank or contains only meaningless scribbles (i.e., lines, circles, etc)

work page

[65] [67]

Process Unnecessary:The problem itself does not require a written process to solve. C.4.2 Examples Figure 14 illustrates examples for four of the five categories: C.5 Manual Evaluation of ARC-AGI-2 To provide a more fine-grained assessment of Sora-2’s performance on ARC-AGI-2 beyond binary correctness, we manually evaluated 100 randomly selected samples a...

work page