Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Dongjie Cheng; Heming Xia; Wenjie Li; Yongqi Li; Yutong Bian

arxiv: 2606.09585 · v1 · pith:5Z7TBQEAnew · submitted 2026-06-08 · 💻 cs.AI

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Yutong Bian , Dongjie Cheng , Heming Xia , Yongqi Li , Wenjie Li This is my paper

Pith reviewed 2026-06-27 16:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords optical reasoningvisual rationaleschain-of-thoughtmultimodal reasoningtoken efficiencytypographic reasoninggraphical reasoninginterleaved-modal reasoning

0 comments

The pith

Images alone can serve as a complete reasoning medium that matches or exceeds text chain-of-thought while cutting token counts substantially.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes optical reasoning as a way to replace text-based chains with images that encode all reasoning steps for both language and multimodal problems. It implements this through two forms: typographic layouts that pack rationales compactly into images and graphical compositions that blend text with diagrams. Tests on mathematical, scientific, and interleaved-modal benchmarks show the visual approach performs at least as well as text while using fewer tokens. A sympathetic reader would care because the work suggests visual formats can compress and unify reasoning without sacrificing capability.

Core claim

Optical reasoning treats images as a standalone reasoning medium for both language and multimodal tasks. The two variants are typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, this approach matches or exceeds traditional text reasoning while reducing reasoning tokens by an average of 28.57 percent on language tasks and 16 percent on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning.

What carries the argument

Optical reasoning, the method of using generated images as the sole medium to encode and present reasoning steps, implemented via typographic and graphical variants.

If this is right

Optical reasoning matches or exceeds text-based reasoning on mathematical, scientific, and interleaved-modal benchmarks.
It reduces reasoning tokens by 28.57 percent on language tasks and 16 percent on multimodal tasks.
It delivers 1.96 times the token efficiency of text reasoning.
Images supply a unified visual canvas that encodes rationales both effectively and efficiently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future models could be trained to output visual rationales directly instead of generating text first.
This approach might lower compute costs for long reasoning sequences by exploiting the density of visual formats.
Educational tools or scientific visualization systems could adopt visual step-by-step reasoning to improve clarity.
Spatial or diagrammatic tasks may show even larger gains when reasoning stays entirely in image form.

Load-bearing premise

Multimodal models can accurately read and reason over the generated images without losing information or adding hallucinations compared to the equivalent text chains.

What would settle it

A controlled test in which identical reasoning content is presented once as text and once as the paper's generated image, and the model produces more errors on the image version.

read the original abstract

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Images as standalone reasoning medium is a clean new framing, but the efficiency and accuracy claims rest on an untested assumption that MLLMs read the visuals without losing content.

read the letter

The paper's main contribution is the proposal to drop text entirely from the reasoning chain and let images carry the full rationale for both pure language tasks and multimodal ones. They split the approach into typographic layouts that pack text densely into images and graphical versions that add diagrams and structure. That move goes past the usual interleaved text-plus-image setups in the cited CoT literature.

The reported outcome is that these visual rationales match or beat standard text reasoning on math, science, and multimodal benchmarks while cutting token count by roughly 29 percent on language tasks and 16 percent on multimodal ones, for an overall 1.96 times efficiency gain. If the numbers hold, the token saving is the practical hook.

The soft spot is exactly where the stress-test note points: nothing in the abstract shows that the MLLM extracts the same reasoning steps from the generated image as it would from the equivalent text chain. There is no fidelity metric, no ablation on layout errors or symbol misreads, and no error analysis. Without that check, any systematic information loss would simultaneously explain away both the accuracy parity and the token reduction. The abstract also gives no model sizes, dataset details, generation procedure for the visuals, or baseline comparisons, so the numbers cannot be evaluated yet.

The idea itself is straightforward and internally consistent on its own terms. It does not appear to invent new entities or hide circular derivations. The limitation is simply that the central empirical claim is not yet supported by the visible evidence.

This is the kind of paper a multimodal reasoning group should see, mainly to discuss whether the visual-fidelity gap can be closed. It deserves a serious referee pass so the authors can supply the missing ablations or the reviewers can flag the gap clearly. I would not cite it yet, but I would want to know what the full experiments actually show.

Referee Report

3 major / 2 minor

Summary. The paper proposes 'optical reasoning,' in which images alone (via typographic layouts or graphical compositions) serve as the complete reasoning medium for both language-only and multimodal tasks, replacing text-based chain-of-thought. It reports that this approach matches or exceeds standard text reasoning on mathematical, scientific, and interleaved-modal benchmarks while cutting reasoning tokens by 28.57% (language) and 16% (multimodal) on average and delivering 1.96× token efficiency.

Significance. If substantiated, the result would demonstrate that a purely visual reasoning substrate can compress and structure rationales more efficiently than text without sacrificing accuracy, opening a new design space for unified multimodal reasoning systems. The work is credited for explicitly framing images as a standalone expressive medium rather than an auxiliary one.

major comments (3)

[§4] §4 (Experiments) and associated tables: the headline performance and efficiency numbers (28.57% token reduction, 1.96× efficiency) are stated without dataset sizes, number of runs, error bars, or statistical significance tests; this directly affects verifiability of the central claim that optical reasoning matches or exceeds text reasoning.
[§3, §5] §3 (Method) and §5 (Results): no ablation, fidelity metric, or error analysis is provided on whether the MLLM extracts identical reasoning content from the generated visual rationales as from the corresponding text chains (e.g., layout misreads, symbol misrecognition, or hallucinated relations); this assumption is load-bearing for both the accuracy parity and the token-efficiency ratio.
[§3.2] §3.2 (Graphical-based optical reasoning): the description of how text and graphical elements are composed into structured visual rationales lacks sufficient detail on the generation procedure, optimization objective, or constraints used to ensure the visual form remains interpretable by the target MLLM.

minor comments (2)

[Abstract, §1] Abstract and §1: the efficiency multiplier (1.96×) is introduced without an explicit definition or reference to the exact baseline token count used in the ratio.
[Figures/Tables] Figure captions and tables: several result visualizations lack axis labels or legend entries that would allow direct comparison of token counts across conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and commit to revisions that improve verifiability and clarity without altering the core claims.

read point-by-point responses

Referee: [§4] §4 (Experiments) and associated tables: the headline performance and efficiency numbers (28.57% token reduction, 1.96× efficiency) are stated without dataset sizes, number of runs, error bars, or statistical significance tests; this directly affects verifiability of the central claim that optical reasoning matches or exceeds text reasoning.

Authors: We agree these reporting details are necessary for rigorous verification. The experiments used standard benchmark splits with repeated runs, but the manuscript omitted the specifics. In revision we will add exact dataset sizes, number of runs, standard deviations or error bars, and statistical significance tests (e.g., paired t-tests) for all headline comparisons. revision: yes
Referee: [§3, §5] §3 (Method) and §5 (Results): no ablation, fidelity metric, or error analysis is provided on whether the MLLM extracts identical reasoning content from the generated visual rationales as from the corresponding text chains (e.g., layout misreads, symbol misrecognition, or hallucinated relations); this assumption is load-bearing for both the accuracy parity and the token-efficiency ratio.

Authors: This point is well taken; direct fidelity verification strengthens the central assumption. While end-to-end accuracy parity was observed, explicit checks were not reported. We will add an ablation and error-analysis subsection that quantifies content agreement (via manual annotation and automated metrics) between visual and text rationales, including discussion of misread cases. revision: yes
Referee: [§3.2] §3.2 (Graphical-based optical reasoning): the description of how text and graphical elements are composed into structured visual rationales lacks sufficient detail on the generation procedure, optimization objective, or constraints used to ensure the visual form remains interpretable by the target MLLM.

Authors: We will expand §3.2 with additional detail. The revision will specify the generation pipeline, the exact optimization objective (token minimization subject to semantic preservation), and the layout constraints (non-overlap rules, font-size bounds, alignment heuristics) used to guarantee MLLM interpretability, accompanied by pseudocode and further examples. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces optical reasoning as a novel concept instantiated via typographic and graphical variants, then reports empirical benchmark outcomes on performance parity and token reductions. No equations, mathematical derivations, fitted parameters, or self-citations appear in the abstract or description. Claims rest on experimental results rather than any chain that reduces by construction to inputs, self-definitions, or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or quantified in the provided text.

pith-pipeline@v0.9.1-grok · 5741 in / 1125 out tokens · 16636 ms · 2026-06-27T16:43:41.879831+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 1 canonical work pages

[1]

Chain-of-thought prompting elicits reasoning in large language models, 2023

JasonWei,XuezhiWang,DaleSchuurmans,Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

Pith/arXiv arXiv 2023
[2]

doi: 10.1038/s41586-025-09422-z

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645 (8081):633–638, Sept 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[3]

Kam- cot: Knowledge augmented multimodal chain-of- thoughts reasoning, 2024

Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam- cot: Knowledge augmented multimodal chain-of- thoughts reasoning, 2024. URLhttps://arxiv. org/abs/2401.12863

arXiv 2024
[4]

aha moment

Hengguang Zhou, Xirui Li, Ruochen Wang, Min- hao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1- zero’s "aha moment" in visual reasoning on a 2b non-sft model, 2025. URLhttps://arxiv.org/ abs/2503.05132

Pith/arXiv arXiv 2025
[5]

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning, 2025

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixi- ang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning, 2025. URLhttps://arxiv.org/ abs/2503.07365

Pith/arXiv arXiv 2025
[6]

Interleaved-Modal Chain-of-Thought, March

Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-Modal Chain-of-Thought, March
[7]

URL http://arxiv.org/abs/2411. 19488. arXiv:2411.19488 [cs]

arXiv
[8]

Think- ing with Images

Ziwei Zheng, Michael Yang, Jack Hong, Chenx- iao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. DeepEyes: Incentivizing "Think- ing with Images" via Reinforcement Learning, March 2026. URL http://arxiv.org/abs/ 2505.14362. arXiv:2505.14362 [cs]

Pith/arXiv arXiv 2026
[9]

Omni-r1: Towards the unified generative paradigm for multimodal reasoning, 2026

Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, YupengHu,WenjieWang,LiqiangNie,andWenjieLi. Omni-r1: Towards the unified generative paradigm for multimodal reasoning, 2026. URL https:// arxiv.org/abs/2601.09536

Pith/arXiv arXiv 2026
[10]

DeepSeek- OCR: Contexts Optical Compression, October

Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek- OCR: Contexts Optical Compression, October
[11]

URL http://arxiv.org/abs/2510. 18234. arXiv:2510.18234 [cs]

Pith/arXiv arXiv
[12]

CodeOCR: On the Effectiveness of Vi- sion Language Models in Code Understanding, April 2026

Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, and Xiaodong Gu. CodeOCR: On the Effectiveness of Vi- sion Language Models in Code Understanding, April 2026. URLhttp://arxiv.org/abs/2602. 01785. arXiv:2602.01785 [cs]

Pith/arXiv arXiv 2026
[13]

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reason- ing, March 2026

Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, and An Zhang. MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reason- ing, March 2026. URLhttp://arxiv.org/abs/ 2601.21468. arXiv:2601.21468 [cs]

Pith/arXiv arXiv 2026
[14]

AgentOCR: Reimagining Agent History via Optical Self-Compression, February

Lang Feng, Fuchao Yang, Feng Chen, Xin Cheng, Haiyang Xu, Zhenglin Wan, Ming Yan, and Bo An. AgentOCR: Reimagining Agent History via Optical Self-Compression, February
[15]

URL http://arxiv.org/abs/2601. 04786. arXiv:2601.04786 [cs]

arXiv
[16]

Openai gpt-5 sys- tem card, 2026

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, Ak- shay Nathan, Alan Luo, et al. Openai gpt-5 sys- tem card, 2026. URLhttps://arxiv.org/abs/ 2601.03267

Pith/arXiv arXiv 2026
[17]

Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL 10 Optical Reasoni...

Pith/arXiv arXiv 2025
[18]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Ji- ahao Chen, et al. Kimi k2.5: Visual agentic intel- ligence, 2026. URL https://arxiv.org/abs/ 2602.02276

Pith/arXiv arXiv 2026
[19]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixua...

Pith/arXiv arXiv 2025
[20]

Mint-cot: Enablinginterleavedvisualtokensinmath- ematical chain-of-thought reasoning, 2025

Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, and Hongsheng Li. Mint-cot: Enablinginterleavedvisualtokensinmath- ematical chain-of-thought reasoning, 2025. URL https://arxiv.org/abs/2506.05331

arXiv 2025
[21]

Imagine while reasoning in space: Multi- modal visualization-of-thought, 2025

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. Imagine while reasoning in space: Multi- modal visualization-of-thought, 2025. URLhttps: //arxiv.org/abs/2501.07542

Pith/arXiv arXiv 2025
[22]

Zebra-cot: A dataset for in- terleaved vision language reasoning, 2025

Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum. Zebra-cot: A dataset for in- terleaved vision language reasoning, 2025. URL https://arxiv.org/abs/2507.16746

arXiv 2025
[23]

Glyph: Scaling context windows via visual-text compression, 2025

Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, and Minlie Huang. Glyph: Scaling context windows via visual-text compression, 2025. URL https://arxiv.org/abs/2510.17800

arXiv 2025
[24]

Vtc-r1: Vision-text compression for efficient long-context reasoning, 2026

Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong cheng Tu, Chengyu Wang, Jun Huang, and Dacheng Tao. Vtc-r1: Vision-text compression for efficient long-context reasoning, 2026. URLhttps: //arxiv.org/abs/2601.22069

arXiv 2026
[25]

Render-of-thought: Rendering textual chain-of-thought as images for vi- sual latent reasoning, 2026

Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, and Zheng Wei. Render-of-thought: Rendering textual chain-of-thought as images for vi- sual latent reasoning, 2026. URLhttps://arxiv. org/abs/2601.14750

Pith/arXiv arXiv 2026
[26]

Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems.CoRR, abs/1705.04146, 2017

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems.CoRR, abs/1705.04146, 2017. URL http://arxiv.org/abs/1705.04146

Pith/arXiv arXiv 2017
[27]

Trainingverifierstosolvemathwordproblems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Trainingverifierstosolvemathwordproblems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[28]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof qa benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022

Pith/arXiv arXiv 2023
[29]

Learn to explain: Mul- timodal reasoning via thought chains for science question answering, 2022

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Mul- timodal reasoning via thought chains for science question answering, 2022. URLhttps://arxiv. org/abs/2209.09513

arXiv 2022
[30]

matplotlib–a portable python plotting package

Paul Barrett, John Hunter, J Todd Miller, J-C Hsu, and Perry Greenfield. matplotlib–a portable python plotting package. InAstronomical data analysis soft- ware and systems XIV, volume 347, page 91, 2005

2005
[31]

verdict":

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression, 2024. URLhttps://arxiv. org/abs/2403.12968. A Case Study To qualitatively illustrate why o...

arXiv 2024

[1] [1]

Chain-of-thought prompting elicits reasoning in large language models, 2023

JasonWei,XuezhiWang,DaleSchuurmans,Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

Pith/arXiv arXiv 2023

[2] [2]

doi: 10.1038/s41586-025-09422-z

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645 (8081):633–638, Sept 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025

[3] [3]

Kam- cot: Knowledge augmented multimodal chain-of- thoughts reasoning, 2024

Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam- cot: Knowledge augmented multimodal chain-of- thoughts reasoning, 2024. URLhttps://arxiv. org/abs/2401.12863

arXiv 2024

[4] [4]

aha moment

Hengguang Zhou, Xirui Li, Ruochen Wang, Min- hao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1- zero’s "aha moment" in visual reasoning on a 2b non-sft model, 2025. URLhttps://arxiv.org/ abs/2503.05132

Pith/arXiv arXiv 2025

[5] [5]

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning, 2025

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixi- ang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning, 2025. URLhttps://arxiv.org/ abs/2503.07365

Pith/arXiv arXiv 2025

[6] [6]

Interleaved-Modal Chain-of-Thought, March

Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-Modal Chain-of-Thought, March

[7] [7]

URL http://arxiv.org/abs/2411. 19488. arXiv:2411.19488 [cs]

arXiv

[8] [8]

Think- ing with Images

Ziwei Zheng, Michael Yang, Jack Hong, Chenx- iao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. DeepEyes: Incentivizing "Think- ing with Images" via Reinforcement Learning, March 2026. URL http://arxiv.org/abs/ 2505.14362. arXiv:2505.14362 [cs]

Pith/arXiv arXiv 2026

[9] [9]

Omni-r1: Towards the unified generative paradigm for multimodal reasoning, 2026

Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, YupengHu,WenjieWang,LiqiangNie,andWenjieLi. Omni-r1: Towards the unified generative paradigm for multimodal reasoning, 2026. URL https:// arxiv.org/abs/2601.09536

Pith/arXiv arXiv 2026

[10] [10]

DeepSeek- OCR: Contexts Optical Compression, October

Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek- OCR: Contexts Optical Compression, October

[11] [11]

URL http://arxiv.org/abs/2510. 18234. arXiv:2510.18234 [cs]

Pith/arXiv arXiv

[12] [12]

CodeOCR: On the Effectiveness of Vi- sion Language Models in Code Understanding, April 2026

Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, and Xiaodong Gu. CodeOCR: On the Effectiveness of Vi- sion Language Models in Code Understanding, April 2026. URLhttp://arxiv.org/abs/2602. 01785. arXiv:2602.01785 [cs]

Pith/arXiv arXiv 2026

[13] [13]

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reason- ing, March 2026

Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, and An Zhang. MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reason- ing, March 2026. URLhttp://arxiv.org/abs/ 2601.21468. arXiv:2601.21468 [cs]

Pith/arXiv arXiv 2026

[14] [14]

AgentOCR: Reimagining Agent History via Optical Self-Compression, February

Lang Feng, Fuchao Yang, Feng Chen, Xin Cheng, Haiyang Xu, Zhenglin Wan, Ming Yan, and Bo An. AgentOCR: Reimagining Agent History via Optical Self-Compression, February

[15] [15]

URL http://arxiv.org/abs/2601. 04786. arXiv:2601.04786 [cs]

arXiv

[16] [16]

Openai gpt-5 sys- tem card, 2026

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, Ak- shay Nathan, Alan Luo, et al. Openai gpt-5 sys- tem card, 2026. URLhttps://arxiv.org/abs/ 2601.03267

Pith/arXiv arXiv 2026

[17] [17]

Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL 10 Optical Reasoni...

Pith/arXiv arXiv 2025

[18] [18]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Ji- ahao Chen, et al. Kimi k2.5: Visual agentic intel- ligence, 2026. URL https://arxiv.org/abs/ 2602.02276

Pith/arXiv arXiv 2026

[19] [19]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixua...

Pith/arXiv arXiv 2025

[20] [20]

Mint-cot: Enablinginterleavedvisualtokensinmath- ematical chain-of-thought reasoning, 2025

Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, and Hongsheng Li. Mint-cot: Enablinginterleavedvisualtokensinmath- ematical chain-of-thought reasoning, 2025. URL https://arxiv.org/abs/2506.05331

arXiv 2025

[21] [21]

Imagine while reasoning in space: Multi- modal visualization-of-thought, 2025

Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. Imagine while reasoning in space: Multi- modal visualization-of-thought, 2025. URLhttps: //arxiv.org/abs/2501.07542

Pith/arXiv arXiv 2025

[22] [22]

Zebra-cot: A dataset for in- terleaved vision language reasoning, 2025

Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum. Zebra-cot: A dataset for in- terleaved vision language reasoning, 2025. URL https://arxiv.org/abs/2507.16746

arXiv 2025

[23] [23]

Glyph: Scaling context windows via visual-text compression, 2025

Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, and Minlie Huang. Glyph: Scaling context windows via visual-text compression, 2025. URL https://arxiv.org/abs/2510.17800

arXiv 2025

[24] [24]

Vtc-r1: Vision-text compression for efficient long-context reasoning, 2026

Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong cheng Tu, Chengyu Wang, Jun Huang, and Dacheng Tao. Vtc-r1: Vision-text compression for efficient long-context reasoning, 2026. URLhttps: //arxiv.org/abs/2601.22069

arXiv 2026

[25] [25]

Render-of-thought: Rendering textual chain-of-thought as images for vi- sual latent reasoning, 2026

Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, and Zheng Wei. Render-of-thought: Rendering textual chain-of-thought as images for vi- sual latent reasoning, 2026. URLhttps://arxiv. org/abs/2601.14750

Pith/arXiv arXiv 2026

[26] [26]

Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems.CoRR, abs/1705.04146, 2017

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems.CoRR, abs/1705.04146, 2017. URL http://arxiv.org/abs/1705.04146

Pith/arXiv arXiv 2017

[27] [27]

Trainingverifierstosolvemathwordproblems.arXiv preprint arXiv:2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Trainingverifierstosolvemathwordproblems.arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[28] [28]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof qa benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022

Pith/arXiv arXiv 2023

[29] [29]

Learn to explain: Mul- timodal reasoning via thought chains for science question answering, 2022

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Mul- timodal reasoning via thought chains for science question answering, 2022. URLhttps://arxiv. org/abs/2209.09513

arXiv 2022

[30] [30]

matplotlib–a portable python plotting package

Paul Barrett, John Hunter, J Todd Miller, J-C Hsu, and Perry Greenfield. matplotlib–a portable python plotting package. InAstronomical data analysis soft- ware and systems XIV, volume 347, page 91, 2005

2005

[31] [31]

verdict":

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression, 2024. URLhttps://arxiv. org/abs/2403.12968. A Case Study To qualitatively illustrate why o...

arXiv 2024