Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text
Pith reviewed 2026-06-27 16:43 UTC · model grok-4.3
The pith
Images alone can serve as a complete reasoning medium that matches or exceeds text chain-of-thought while cutting token counts substantially.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Optical reasoning treats images as a standalone reasoning medium for both language and multimodal tasks. The two variants are typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, this approach matches or exceeds traditional text reasoning while reducing reasoning tokens by an average of 28.57 percent on language tasks and 16 percent on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning.
What carries the argument
Optical reasoning, the method of using generated images as the sole medium to encode and present reasoning steps, implemented via typographic and graphical variants.
If this is right
- Optical reasoning matches or exceeds text-based reasoning on mathematical, scientific, and interleaved-modal benchmarks.
- It reduces reasoning tokens by 28.57 percent on language tasks and 16 percent on multimodal tasks.
- It delivers 1.96 times the token efficiency of text reasoning.
- Images supply a unified visual canvas that encodes rationales both effectively and efficiently.
Where Pith is reading between the lines
- Future models could be trained to output visual rationales directly instead of generating text first.
- This approach might lower compute costs for long reasoning sequences by exploiting the density of visual formats.
- Educational tools or scientific visualization systems could adopt visual step-by-step reasoning to improve clarity.
- Spatial or diagrammatic tasks may show even larger gains when reasoning stays entirely in image form.
Load-bearing premise
Multimodal models can accurately read and reason over the generated images without losing information or adding hallucinations compared to the equivalent text chains.
What would settle it
A controlled test in which identical reasoning content is presented once as text and once as the paper's generated image, and the model produces more errors on the image version.
read the original abstract
Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes 'optical reasoning,' in which images alone (via typographic layouts or graphical compositions) serve as the complete reasoning medium for both language-only and multimodal tasks, replacing text-based chain-of-thought. It reports that this approach matches or exceeds standard text reasoning on mathematical, scientific, and interleaved-modal benchmarks while cutting reasoning tokens by 28.57% (language) and 16% (multimodal) on average and delivering 1.96× token efficiency.
Significance. If substantiated, the result would demonstrate that a purely visual reasoning substrate can compress and structure rationales more efficiently than text without sacrificing accuracy, opening a new design space for unified multimodal reasoning systems. The work is credited for explicitly framing images as a standalone expressive medium rather than an auxiliary one.
major comments (3)
- [§4] §4 (Experiments) and associated tables: the headline performance and efficiency numbers (28.57% token reduction, 1.96× efficiency) are stated without dataset sizes, number of runs, error bars, or statistical significance tests; this directly affects verifiability of the central claim that optical reasoning matches or exceeds text reasoning.
- [§3, §5] §3 (Method) and §5 (Results): no ablation, fidelity metric, or error analysis is provided on whether the MLLM extracts identical reasoning content from the generated visual rationales as from the corresponding text chains (e.g., layout misreads, symbol misrecognition, or hallucinated relations); this assumption is load-bearing for both the accuracy parity and the token-efficiency ratio.
- [§3.2] §3.2 (Graphical-based optical reasoning): the description of how text and graphical elements are composed into structured visual rationales lacks sufficient detail on the generation procedure, optimization objective, or constraints used to ensure the visual form remains interpretable by the target MLLM.
minor comments (2)
- [Abstract, §1] Abstract and §1: the efficiency multiplier (1.96×) is introduced without an explicit definition or reference to the exact baseline token count used in the ratio.
- [Figures/Tables] Figure captions and tables: several result visualizations lack axis labels or legend entries that would allow direct comparison of token counts across conditions.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below and commit to revisions that improve verifiability and clarity without altering the core claims.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated tables: the headline performance and efficiency numbers (28.57% token reduction, 1.96× efficiency) are stated without dataset sizes, number of runs, error bars, or statistical significance tests; this directly affects verifiability of the central claim that optical reasoning matches or exceeds text reasoning.
Authors: We agree these reporting details are necessary for rigorous verification. The experiments used standard benchmark splits with repeated runs, but the manuscript omitted the specifics. In revision we will add exact dataset sizes, number of runs, standard deviations or error bars, and statistical significance tests (e.g., paired t-tests) for all headline comparisons. revision: yes
-
Referee: [§3, §5] §3 (Method) and §5 (Results): no ablation, fidelity metric, or error analysis is provided on whether the MLLM extracts identical reasoning content from the generated visual rationales as from the corresponding text chains (e.g., layout misreads, symbol misrecognition, or hallucinated relations); this assumption is load-bearing for both the accuracy parity and the token-efficiency ratio.
Authors: This point is well taken; direct fidelity verification strengthens the central assumption. While end-to-end accuracy parity was observed, explicit checks were not reported. We will add an ablation and error-analysis subsection that quantifies content agreement (via manual annotation and automated metrics) between visual and text rationales, including discussion of misread cases. revision: yes
-
Referee: [§3.2] §3.2 (Graphical-based optical reasoning): the description of how text and graphical elements are composed into structured visual rationales lacks sufficient detail on the generation procedure, optimization objective, or constraints used to ensure the visual form remains interpretable by the target MLLM.
Authors: We will expand §3.2 with additional detail. The revision will specify the generation pipeline, the exact optimization objective (token minimization subject to semantic preservation), and the layout constraints (non-overlap rules, font-size bounds, alignment heuristics) used to guarantee MLLM interpretability, accompanied by pseudocode and further examples. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper introduces optical reasoning as a novel concept instantiated via typographic and graphical variants, then reports empirical benchmark outcomes on performance parity and token reductions. No equations, mathematical derivations, fitted parameters, or self-citations appear in the abstract or description. Claims rest on experimental results rather than any chain that reduces by construction to inputs, self-definitions, or prior author work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Chain-of-thought prompting elicits reasoning in large language models, 2023
JasonWei,XuezhiWang,DaleSchuurmans,Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903
Pith/arXiv arXiv 2023
-
[2]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645 (8081):633–638, Sept 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z
-
[3]
Kam- cot: Knowledge augmented multimodal chain-of- thoughts reasoning, 2024
Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam- cot: Knowledge augmented multimodal chain-of- thoughts reasoning, 2024. URLhttps://arxiv. org/abs/2401.12863
arXiv 2024
-
[4]
Hengguang Zhou, Xirui Li, Ruochen Wang, Min- hao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1- zero’s "aha moment" in visual reasoning on a 2b non-sft model, 2025. URLhttps://arxiv.org/ abs/2503.05132
Pith/arXiv arXiv 2025
-
[5]
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixi- ang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning, 2025. URLhttps://arxiv.org/ abs/2503.07365
Pith/arXiv arXiv 2025
-
[6]
Interleaved-Modal Chain-of-Thought, March
Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-Modal Chain-of-Thought, March
-
[7]
URL http://arxiv.org/abs/2411. 19488. arXiv:2411.19488 [cs]
-
[8]
Ziwei Zheng, Michael Yang, Jack Hong, Chenx- iao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. DeepEyes: Incentivizing "Think- ing with Images" via Reinforcement Learning, March 2026. URL http://arxiv.org/abs/ 2505.14362. arXiv:2505.14362 [cs]
Pith/arXiv arXiv 2026
-
[9]
Omni-r1: Towards the unified generative paradigm for multimodal reasoning, 2026
Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, YupengHu,WenjieWang,LiqiangNie,andWenjieLi. Omni-r1: Towards the unified generative paradigm for multimodal reasoning, 2026. URL https:// arxiv.org/abs/2601.09536
Pith/arXiv arXiv 2026
-
[10]
DeepSeek- OCR: Contexts Optical Compression, October
Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek- OCR: Contexts Optical Compression, October
-
[11]
URL http://arxiv.org/abs/2510. 18234. arXiv:2510.18234 [cs]
-
[12]
CodeOCR: On the Effectiveness of Vi- sion Language Models in Code Understanding, April 2026
Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, and Xiaodong Gu. CodeOCR: On the Effectiveness of Vi- sion Language Models in Code Understanding, April 2026. URLhttp://arxiv.org/abs/2602. 01785. arXiv:2602.01785 [cs]
Pith/arXiv arXiv 2026
-
[13]
MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reason- ing, March 2026
Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, and An Zhang. MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reason- ing, March 2026. URLhttp://arxiv.org/abs/ 2601.21468. arXiv:2601.21468 [cs]
Pith/arXiv arXiv 2026
-
[14]
AgentOCR: Reimagining Agent History via Optical Self-Compression, February
Lang Feng, Fuchao Yang, Feng Chen, Xin Cheng, Haiyang Xu, Zhenglin Wan, Ming Yan, and Bo An. AgentOCR: Reimagining Agent History via Optical Self-Compression, February
-
[15]
URL http://arxiv.org/abs/2601. 04786. arXiv:2601.04786 [cs]
-
[16]
Openai gpt-5 sys- tem card, 2026
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, Ak- shay Nathan, Alan Luo, et al. Openai gpt-5 sys- tem card, 2026. URLhttps://arxiv.org/abs/ 2601.03267
Pith/arXiv arXiv 2026
-
[17]
Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL 10 Optical Reasoni...
Pith/arXiv arXiv 2025
-
[18]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Ji- ahao Chen, et al. Kimi k2.5: Visual agentic intel- ligence, 2026. URL https://arxiv.org/abs/ 2602.02276
Pith/arXiv arXiv 2026
-
[19]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixua...
Pith/arXiv arXiv 2025
-
[20]
Mint-cot: Enablinginterleavedvisualtokensinmath- ematical chain-of-thought reasoning, 2025
Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, and Hongsheng Li. Mint-cot: Enablinginterleavedvisualtokensinmath- ematical chain-of-thought reasoning, 2025. URL https://arxiv.org/abs/2506.05331
arXiv 2025
-
[21]
Imagine while reasoning in space: Multi- modal visualization-of-thought, 2025
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. Imagine while reasoning in space: Multi- modal visualization-of-thought, 2025. URLhttps: //arxiv.org/abs/2501.07542
Pith/arXiv arXiv 2025
-
[22]
Zebra-cot: A dataset for in- terleaved vision language reasoning, 2025
Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum. Zebra-cot: A dataset for in- terleaved vision language reasoning, 2025. URL https://arxiv.org/abs/2507.16746
arXiv 2025
-
[23]
Glyph: Scaling context windows via visual-text compression, 2025
Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, and Minlie Huang. Glyph: Scaling context windows via visual-text compression, 2025. URL https://arxiv.org/abs/2510.17800
arXiv 2025
-
[24]
Vtc-r1: Vision-text compression for efficient long-context reasoning, 2026
Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong cheng Tu, Chengyu Wang, Jun Huang, and Dacheng Tao. Vtc-r1: Vision-text compression for efficient long-context reasoning, 2026. URLhttps: //arxiv.org/abs/2601.22069
arXiv 2026
-
[25]
Render-of-thought: Rendering textual chain-of-thought as images for vi- sual latent reasoning, 2026
Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, and Zheng Wei. Render-of-thought: Rendering textual chain-of-thought as images for vi- sual latent reasoning, 2026. URLhttps://arxiv. org/abs/2601.14750
Pith/arXiv arXiv 2026
-
[26]
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems.CoRR, abs/1705.04146, 2017. URL http://arxiv.org/abs/1705.04146
Pith/arXiv arXiv 2017
-
[27]
Trainingverifierstosolvemathwordproblems.arXiv preprint arXiv:2110.14168, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Trainingverifierstosolvemathwordproblems.arXiv preprint arXiv:2110.14168, 2021
Pith/arXiv arXiv 2021
-
[28]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof qa benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022
Pith/arXiv arXiv 2023
-
[29]
Learn to explain: Mul- timodal reasoning via thought chains for science question answering, 2022
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Mul- timodal reasoning via thought chains for science question answering, 2022. URLhttps://arxiv. org/abs/2209.09513
arXiv 2022
-
[30]
matplotlib–a portable python plotting package
Paul Barrett, John Hunter, J Todd Miller, J-C Hsu, and Perry Greenfield. matplotlib–a portable python plotting package. InAstronomical data analysis soft- ware and systems XIV, volume 347, page 91, 2005
2005
-
[31]
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression, 2024. URLhttps://arxiv. org/abs/2403.12968. A Case Study To qualitatively illustrate why o...
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.