Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Pith reviewed 2026-05-18 00:51 UTC · model grok-4.3
The pith
Video generation models can unify multimodal reasoning by using generated video sequences as a single medium for both visual dynamics and textual logic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that video generation models such as Sora-2 function as capable multimodal reasoners on VideoThinkBench, achieving 92 percent accuracy on MATH, 69.2 percent on MMMU, and surpassing GPT-5 by 10 percent on eyeballing puzzles, which supports the positioning of Thinking with Video as a potential unified multimodal understanding and generation paradigm.
What carries the argument
The Thinking with Video paradigm, which employs sequences of generated video frames as a unified medium to represent dynamic processes and multimodal information for reasoning.
If this is right
- Separate vision-language and language models could be replaced by a single video generation backbone for tasks involving continuous change.
- Self-consistency and in-context learning techniques can be applied directly to improve video-based reasoning outputs.
- Benchmarks focused on dynamic processes would become more central than static image or text-only evaluations.
- Unified models would naturally support both understanding queries and generating explanatory video sequences.
Where Pith is reading between the lines
- This paradigm could extend to domains requiring real-time simulation of physical processes, such as planning in robotics.
- Future work might test whether video generation models maintain performance when forced to output non-visual reasoning traces.
- Integration with existing tools could allow video models to interleave frame generation with symbolic computation steps.
Load-bearing premise
That strong results on the new VideoThinkBench tasks reflect genuine reasoning rather than the model's skill at producing plausible video sequences drawn from training patterns.
What would settle it
A clear drop in Sora-2 performance on novel reasoning problems that require tracking unseen temporal relationships or logical steps not present in its training distribution.
read the original abstract
The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which covers both vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., GSM8K and MMMU). Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is comparable to state-of-the-art (SOTA) VLMs, and even surpasses GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyze the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings show that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a potential unified multimodal reasoning paradigm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes 'Thinking with Video' as a new multimodal reasoning paradigm that uses video generation models such as Sora-2 to produce video sequences whose frames encode step-by-step solutions. It introduces the VideoThinkBench benchmark spanning vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., MATH, MMMU, GSM8K). The authors report that Sora-2 reaches 92% accuracy on MATH, 69.2% on MMMU, and exceeds GPT-5 by 10% on eyeballing puzzles, concluding that video generation models are promising unified multimodal understanding and generation systems.
Significance. If the performance genuinely reflects video-mediated reasoning rather than pattern retrieval, the work could open a new direction for unifying generation and understanding in dynamic visual sequences, extending beyond static images or text chains. The introduction of VideoThinkBench and the reported gains on established benchmarks would be of interest to the multimodal reasoning community.
major comments (3)
- [Abstract] Abstract: the reported accuracies (92% MATH, 69.2% MMMU, +10% on eyeballing puzzles) are presented without any description of the evaluation protocol, answer extraction procedure from generated frames, statistical controls, or task construction details, so it is impossible to verify that the numbers support the reasoning claims.
- [Experiments] The central claim that video generation constitutes a reasoning medium requires evidence that Sora-2 produces novel step-by-step derivations rather than high-probability visual sequences from pre-training. No ablation, control for prompt leakage, or comparison against canonical visual explanations of the same problems is described, leaving the performance numbers open to the alternative interpretation of pattern completion.
- [VideoThinkBench] VideoThinkBench section: the benchmark description does not address potential overlap between the vision-centric and text-centric tasks and common internet video content, nor does it report controls that would distinguish genuine inference from retrieval of training patterns.
minor comments (2)
- [Abstract] Clarify the exact pipeline for converting generated video frames into final answers (e.g., frame sampling, OCR, or downstream VLM extraction).
- [Analysis] Add a brief discussion of how self-consistency and in-context learning are implemented for the video generation model.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which identifies key areas where additional clarity and evidence would strengthen our claims about video generation models as a reasoning paradigm. We address each major comment point by point below, indicating revisions where the manuscript will be updated.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported accuracies (92% MATH, 69.2% MMMU, +10% on eyeballing puzzles) are presented without any description of the evaluation protocol, answer extraction procedure from generated frames, statistical controls, or task construction details, so it is impossible to verify that the numbers support the reasoning claims.
Authors: We agree that the abstract, due to length constraints, omits these details and that this limits immediate verifiability. The full manuscript describes the evaluation protocol, frame-based answer extraction, and task construction in the VideoThinkBench and Experiments sections. We will revise the abstract to incorporate a brief summary of the evaluation protocol, answer extraction method, and key task details, along with a note on statistical controls. revision: yes
-
Referee: [Experiments] The central claim that video generation constitutes a reasoning medium requires evidence that Sora-2 produces novel step-by-step derivations rather than high-probability visual sequences from pre-training. No ablation, control for prompt leakage, or comparison against canonical visual explanations of the same problems is described, leaving the performance numbers open to the alternative interpretation of pattern completion.
Authors: This concern is valid and central to the interpretation of our results. The manuscript includes a systematic analysis of ability sources and qualitative examples of step-by-step video derivations. However, we did not present explicit ablations for prompt leakage or side-by-side comparisons to canonical visual explanations. We will add these controls and ablations in the revised Experiments section to better rule out pure pattern completion. revision: yes
-
Referee: [VideoThinkBench] VideoThinkBench section: the benchmark description does not address potential overlap between the vision-centric and text-centric tasks and common internet video content, nor does it report controls that would distinguish genuine inference from retrieval of training patterns.
Authors: We acknowledge that explicit discussion of overlap and retrieval controls is missing. The benchmark was constructed with novel or modified tasks, particularly for vision-centric puzzles, to minimize direct matches to common video content. We will revise the VideoThinkBench section to include a dedicated discussion of potential overlaps with internet video data and report additional controls, such as performance on held-out problem variants, to support the inference interpretation. revision: yes
Circularity Check
No circularity: empirical evaluation on newly introduced benchmark with external model.
full rationale
The paper proposes the 'Thinking with Video' paradigm conceptually, introduces VideoThinkBench as a new evaluation suite covering vision- and text-centric tasks, and reports direct performance numbers for Sora-2 (92% MATH, 69.2% MMMU, +10% on eyeballing puzzles) plus comparisons to VLMs and GPT-5. No equations, fitted parameters, or derivations appear; the central claim rests on benchmark results rather than any self-definitional loop, renamed prediction, or load-bearing self-citation chain. The evaluation chain is self-contained against external models and tasks, with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video frames can serve as a unified medium that inherently captures dynamic reasoning processes without modality-specific training.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose 'Thinking with Video', a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning.
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Sora-2 achieves 92% accuracy on MATH and 69.2% on MMMU
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Video Models Can Reason with Verifiable Rewards
VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Ma...
-
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
-
Kling-Omni Technical Report
Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.
Reference graph
Works this paper leans on
-
[1]
System card: Claude opus 4 & claude sonnet 4
Anthropic. System card: Claude opus 4 & claude sonnet 4. Technical report, Anthropic, May 2025. URL https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf
work page 2025
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. URLhttps://arxiv.org/abs/2311.15127
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025. URLhttps://arxiv.org/abs/2503.09567
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Yew Ken Chia, Vernon Toh Yan Han, Deepanway Ghosal, Lidong Bing, and Soujanya Poria. Puzzlevqa: Diagnosing multimodal reasoning challenges of language models with abstract visual patterns.arXivpreprintarXiv:2403.13315, 2024
-
[5]
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025. URLhttps://arxiv.org/abs/2505.11831
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXivpreprintarXiv:2507.06261, 2025. URLhttps://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Emu3.5: Native Multimodal Models are World Learners
Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, and Xinlong Wang. Emu3.5: Native multimodal models are world learners, 2025. URLh...
work page internal anchor Pith review arXiv 2025
-
[10]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. URL https://arxiv.org/abs/2505.14683
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
A Survey on In-context Learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A survey on in-context learning.arXiv preprintarXiv:2301.00234, 2024. Updated version v6, October 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Veo 3.https://aistudio.google.com/models/veo-3, 2025
Google. Veo 3.https://aistudio.google.com/models/veo-3, 2025. Accessed on November 7, 2025
work page 2025
-
[15]
Gemini 2.5 flash & 2.5 flash image model card
Google DeepMind. Gemini 2.5 flash & 2.5 flash image model card. Technical report, Google DeepMind, August
-
[16]
URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Flash-Model-Card. pdf. Last updated: August 27, 2025
work page 2025
-
[17]
ZiyuGuo,XinyanChen,RenruiZhang,RuichuanAn,YuQi,DongzhiJiang,XiangtaiLi,ManyuanZhang,Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark, 2025. URLhttps://arxiv.org/abs/2510.26802
-
[18]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXivpreprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[19]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/abs/ 2103.03874
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
WeijieKong, QiTian, ZijianZhang, RoxMin, ZuozhuoDai, JinZhou, JiangfengXiong, XinLi, BoWu, JianweiZhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Enhancing advanced visual reasoning ability of large language models, 2024
Zhiyuan Li, Dongnan Liu, Chaoyi Zhang, Heng Wang, Tengfei Xue, and Weidong Cai. Enhancing advanced visual reasoning ability of large language models, 2024. URLhttps://arxiv.org/abs/2409.13980
-
[22]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computervision, pages 216–233. Springer, 2024
work page 2024
-
[23]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprintarXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
OpenAI. Gpt-4o system card.arXivpreprintarXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
OpenAI. Gpt-5 system card. Technical report, OpenAI, August 2025. URL https://cdn.openai.com/pdf/ 8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf
work page 2025
-
[26]
Learning to reason with llms, 2025
OpenAI. Learning to reason with llms, 2025. URL https://openai.com/zh-Hans-CN/index/ learning-to-reason-with-llms/. Accessed: 2025
work page 2025
-
[27]
OpenAI o3 and o4-mini System Card
OpenAI. OpenAI o3 and o4-mini System Card. Technical report, OpenAI, April 2025. URLhttps://cdn.openai. com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf . Accessed: 2025-11-01
work page 2025
-
[28]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. 18
work page 2018
-
[29]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirstConferenceonLanguage Modeling, 2024
work page 2024
-
[30]
Runway Research. Introducing Gen-3 Alpha: A New Frontier for Video Generation.https://runwayml.com/ research/introducing-gen-3-alpha, June 2024. Accessed on November 7, 2025
work page 2024
-
[31]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URLhttps://arxiv.org/abs/2210.09261
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, YuyangLu,ChaoranTao,ZhiyuanGuo,JizhouYu,TianhaoCheng,ZhihengXi,ChanghaoJiang,ZhangyueYin,Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. Game-rl: Synthesizing multimodalverifiablegamedatatoboostvlms’generalr...
-
[34]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprintarXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advancesin Neural Information ProcessingSystems, 37:95095–95169, 2024
work page 2024
-
[36]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URLhttps: //arxiv.org/abs/2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. AdvancesinNeuralInformationProcessingSystems, 37:95266–95290, 2024
work page 2024
-
[38]
Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels. Advancesinneuralinformationprocessing systems, 35:24824–24837, 2022
work page 2022
-
[39]
Video models are zero-shot learners and reasoners
Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, andRobertGeirhos. Videomodelsarezero-shotlearnersandreasoners. arXivpreprintarXiv:2509.20328, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXivpreprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Demystifying Long Chain-of-Thought Reasoning in LLMs
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. URLhttps://arxiv.org/abs/2502.03373
work page internal anchor Pith review arXiv 2025
-
[42]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024
work page 2024
-
[43]
Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, and Bo Zheng. Rewatch-r1: Boosting complex video reasoning in large vision-language models through agentic data synthesis,
- [44]
-
[45]
arXiv preprint arXiv:2507.09876 , year=
Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei, Wanxiang Che, and Libo Qin. Vitcot: Video-text interleaved chain-of-thought for boosting video understanding in large language models, 2025. URLhttps: //arxiv.org/abs/2507.09876
-
[46]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models, 2024. URLhttps://arxiv.org/abs/2302.00923
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Jun Zhao, Jingqi Tong, Yurong Mou, Ming Zhang, Qi Zhang, and Xuanjing Huang. Exploring the compositional deficiency of large language models in mathematical reasoning.arXivpreprintarXiv:2405.06680, 2024
-
[48]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural informationprocessingsystems, 36:46595–46623, 2023. 20 Appendix Appendix Contents A VideoThinkBench Sample Distribution . . . . . . . . . . . . . ...
work page 2023
-
[49]
First, determine the visible answer from the image using this priority: - If there is an explicit statement indicating the answer (e.g., “The answer is ...”), use that answer. - Else, check for an answer marked by a symbol such as box, circle, underline, arrow, etc. If multiple positions are marked but show different results, respond ’no’ immediately. - E...
-
[50]
Compare the visible answer in the image with the provided correct answer
-
[51]
Be strict but reasonable - minor formatting differences are acceptable if the core answer is correct
-
[52]
For multiple choice questions, check if the correct option (A, B, C, etc.) is clearly marked or highlighted
-
[53]
For numerical answers, check if the number matches (ignore minor formatting like “4” vs “4.0”)
-
[54]
For text answers, check if the key content matches (ignore case sensitivity and minor punctuation)
-
[55]
You must respond with ONLY ’yes’ or ’no’, nothing else User instruction prompt: Question: {question} Correct answer: {correct_answer} Does the image show the correct answer? (The last frame of the generated video is also provided for the model.) Prompt for Evaluating the Answer from the Audio System prompt: You are an expert answer checker for educational...
-
[56]
Check if the transcript explicitly states or clearly implies the correct answer
-
[57]
Be lenient with phrasing - the transcript may explain the answer in different words
-
[58]
For multiple choice questions, check if the correct option (A, B, C, etc.) is mentioned
-
[59]
For numerical answers, check if the number is stated (ignore surrounding explanation)
-
[60]
For text answers, check if the key concept is explained correctly
-
[61]
Common phrases like “the correct answer is...”, “the answer is...”, “it should be...” indicate the answer
-
[62]
You must respond with ONLY ’yes’ or ’no’, nothing else User instruction prompt: Question: {question} Correct answer: {correct_answer} Audio transcript: {transcript} Does the transcript provide the correct answer? B.4.2 Human Alignment Check for Evaluation We performed a human alignment check on a sample of 173 responses across the text-centric tasks to va...
-
[63]
Completely Correct:The solution has a clear and correct process without any errors
-
[64]
Logic Correct with Writing Errors:The solution contains expressional mistakes, but the overall logic is identifiable and correct
-
[65]
Unreadable or Incorrect Logic:The writing is too disorganized or contains too many errors to discern the reasoning, or it exhibits clear logical mistakes or major omissions
-
[66]
Missing Solution Process:Necessary steps are absent; apart from the final answer, the response is blank or contains only meaningless scribbles (i.e., lines, circles, etc)
-
[67]
Process Unnecessary:The problem itself does not require a written process to solve. C.4.2 Examples Figure 14 illustrates examples for four of the five categories: C.5 Manual Evaluation of ARC-AGI-2 To provide a more fine-grained assessment of Sora-2’s performance on ARC-AGI-2 beyond binary correctness, we manually evaluated 100 randomly selected samples a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.