pith. sign in

arxiv: 2606.09585 · v1 · pith:5Z7TBQEAnew · submitted 2026-06-08 · 💻 cs.AI

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Pith reviewed 2026-06-27 16:43 UTC · model grok-4.3

classification 💻 cs.AI
keywords optical reasoningvisual rationaleschain-of-thoughtmultimodal reasoningtoken efficiencytypographic reasoninggraphical reasoninginterleaved-modal reasoning
0
0 comments X

The pith

Images alone can serve as a complete reasoning medium that matches or exceeds text chain-of-thought while cutting token counts substantially.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes optical reasoning as a way to replace text-based chains with images that encode all reasoning steps for both language and multimodal problems. It implements this through two forms: typographic layouts that pack rationales compactly into images and graphical compositions that blend text with diagrams. Tests on mathematical, scientific, and interleaved-modal benchmarks show the visual approach performs at least as well as text while using fewer tokens. A sympathetic reader would care because the work suggests visual formats can compress and unify reasoning without sacrificing capability.

Core claim

Optical reasoning treats images as a standalone reasoning medium for both language and multimodal tasks. The two variants are typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, this approach matches or exceeds traditional text reasoning while reducing reasoning tokens by an average of 28.57 percent on language tasks and 16 percent on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning.

What carries the argument

Optical reasoning, the method of using generated images as the sole medium to encode and present reasoning steps, implemented via typographic and graphical variants.

If this is right

  • Optical reasoning matches or exceeds text-based reasoning on mathematical, scientific, and interleaved-modal benchmarks.
  • It reduces reasoning tokens by 28.57 percent on language tasks and 16 percent on multimodal tasks.
  • It delivers 1.96 times the token efficiency of text reasoning.
  • Images supply a unified visual canvas that encodes rationales both effectively and efficiently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future models could be trained to output visual rationales directly instead of generating text first.
  • This approach might lower compute costs for long reasoning sequences by exploiting the density of visual formats.
  • Educational tools or scientific visualization systems could adopt visual step-by-step reasoning to improve clarity.
  • Spatial or diagrammatic tasks may show even larger gains when reasoning stays entirely in image form.

Load-bearing premise

Multimodal models can accurately read and reason over the generated images without losing information or adding hallucinations compared to the equivalent text chains.

What would settle it

A controlled test in which identical reasoning content is presented once as text and once as the paper's generated image, and the model produces more errors on the image version.

read the original abstract

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes 'optical reasoning,' in which images alone (via typographic layouts or graphical compositions) serve as the complete reasoning medium for both language-only and multimodal tasks, replacing text-based chain-of-thought. It reports that this approach matches or exceeds standard text reasoning on mathematical, scientific, and interleaved-modal benchmarks while cutting reasoning tokens by 28.57% (language) and 16% (multimodal) on average and delivering 1.96× token efficiency.

Significance. If substantiated, the result would demonstrate that a purely visual reasoning substrate can compress and structure rationales more efficiently than text without sacrificing accuracy, opening a new design space for unified multimodal reasoning systems. The work is credited for explicitly framing images as a standalone expressive medium rather than an auxiliary one.

major comments (3)
  1. [§4] §4 (Experiments) and associated tables: the headline performance and efficiency numbers (28.57% token reduction, 1.96× efficiency) are stated without dataset sizes, number of runs, error bars, or statistical significance tests; this directly affects verifiability of the central claim that optical reasoning matches or exceeds text reasoning.
  2. [§3, §5] §3 (Method) and §5 (Results): no ablation, fidelity metric, or error analysis is provided on whether the MLLM extracts identical reasoning content from the generated visual rationales as from the corresponding text chains (e.g., layout misreads, symbol misrecognition, or hallucinated relations); this assumption is load-bearing for both the accuracy parity and the token-efficiency ratio.
  3. [§3.2] §3.2 (Graphical-based optical reasoning): the description of how text and graphical elements are composed into structured visual rationales lacks sufficient detail on the generation procedure, optimization objective, or constraints used to ensure the visual form remains interpretable by the target MLLM.
minor comments (2)
  1. [Abstract, §1] Abstract and §1: the efficiency multiplier (1.96×) is introduced without an explicit definition or reference to the exact baseline token count used in the ratio.
  2. [Figures/Tables] Figure captions and tables: several result visualizations lack axis labels or legend entries that would allow direct comparison of token counts across conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and commit to revisions that improve verifiability and clarity without altering the core claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: the headline performance and efficiency numbers (28.57% token reduction, 1.96× efficiency) are stated without dataset sizes, number of runs, error bars, or statistical significance tests; this directly affects verifiability of the central claim that optical reasoning matches or exceeds text reasoning.

    Authors: We agree these reporting details are necessary for rigorous verification. The experiments used standard benchmark splits with repeated runs, but the manuscript omitted the specifics. In revision we will add exact dataset sizes, number of runs, standard deviations or error bars, and statistical significance tests (e.g., paired t-tests) for all headline comparisons. revision: yes

  2. Referee: [§3, §5] §3 (Method) and §5 (Results): no ablation, fidelity metric, or error analysis is provided on whether the MLLM extracts identical reasoning content from the generated visual rationales as from the corresponding text chains (e.g., layout misreads, symbol misrecognition, or hallucinated relations); this assumption is load-bearing for both the accuracy parity and the token-efficiency ratio.

    Authors: This point is well taken; direct fidelity verification strengthens the central assumption. While end-to-end accuracy parity was observed, explicit checks were not reported. We will add an ablation and error-analysis subsection that quantifies content agreement (via manual annotation and automated metrics) between visual and text rationales, including discussion of misread cases. revision: yes

  3. Referee: [§3.2] §3.2 (Graphical-based optical reasoning): the description of how text and graphical elements are composed into structured visual rationales lacks sufficient detail on the generation procedure, optimization objective, or constraints used to ensure the visual form remains interpretable by the target MLLM.

    Authors: We will expand §3.2 with additional detail. The revision will specify the generation pipeline, the exact optimization objective (token minimization subject to semantic preservation), and the layout constraints (non-overlap rules, font-size bounds, alignment heuristics) used to guarantee MLLM interpretability, accompanied by pseudocode and further examples. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces optical reasoning as a novel concept instantiated via typographic and graphical variants, then reports empirical benchmark outcomes on performance parity and token reductions. No equations, mathematical derivations, fitted parameters, or self-citations appear in the abstract or description. Claims rest on experimental results rather than any chain that reduces by construction to inputs, self-definitions, or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or quantified in the provided text.

pith-pipeline@v0.9.1-grok · 5741 in / 1125 out tokens · 16636 ms · 2026-06-27T16:43:41.879831+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 1 canonical work pages

  1. [1]

    Chain-of-thought prompting elicits reasoning in large language models, 2023

    JasonWei,XuezhiWang,DaleSchuurmans,Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

  2. [2]

    doi: 10.1038/s41586-025-09422-z

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645 (8081):633–638, Sept 2025. ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URL http://dx. doi.org/10.1038/s41586-025-09422-z

  3. [3]

    Kam- cot: Knowledge augmented multimodal chain-of- thoughts reasoning, 2024

    Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam- cot: Knowledge augmented multimodal chain-of- thoughts reasoning, 2024. URLhttps://arxiv. org/abs/2401.12863

  4. [4]

    aha moment

    Hengguang Zhou, Xirui Li, Ruochen Wang, Min- hao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1- zero’s "aha moment" in visual reasoning on a 2b non-sft model, 2025. URLhttps://arxiv.org/ abs/2503.05132

  5. [5]

    Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning, 2025

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixi- ang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning, 2025. URLhttps://arxiv.org/ abs/2503.07365

  6. [6]

    Interleaved-Modal Chain-of-Thought, March

    Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-Modal Chain-of-Thought, March

  7. [7]

    URL http://arxiv.org/abs/2411. 19488. arXiv:2411.19488 [cs]

  8. [8]

    Think- ing with Images

    Ziwei Zheng, Michael Yang, Jack Hong, Chenx- iao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. DeepEyes: Incentivizing "Think- ing with Images" via Reinforcement Learning, March 2026. URL http://arxiv.org/abs/ 2505.14362. arXiv:2505.14362 [cs]

  9. [9]

    Omni-r1: Towards the unified generative paradigm for multimodal reasoning, 2026

    Dongjie Cheng, Yongqi Li, Zhixin Ma, Hongru Cai, YupengHu,WenjieWang,LiqiangNie,andWenjieLi. Omni-r1: Towards the unified generative paradigm for multimodal reasoning, 2026. URL https:// arxiv.org/abs/2601.09536

  10. [10]

    DeepSeek- OCR: Contexts Optical Compression, October

    Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek- OCR: Contexts Optical Compression, October

  11. [11]

    URL http://arxiv.org/abs/2510. 18234. arXiv:2510.18234 [cs]

  12. [12]

    CodeOCR: On the Effectiveness of Vi- sion Language Models in Code Understanding, April 2026

    Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, and Xiaodong Gu. CodeOCR: On the Effectiveness of Vi- sion Language Models in Code Understanding, April 2026. URLhttp://arxiv.org/abs/2602. 01785. arXiv:2602.01785 [cs]

  13. [13]

    MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reason- ing, March 2026

    Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, and An Zhang. MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reason- ing, March 2026. URLhttp://arxiv.org/abs/ 2601.21468. arXiv:2601.21468 [cs]

  14. [14]

    AgentOCR: Reimagining Agent History via Optical Self-Compression, February

    Lang Feng, Fuchao Yang, Feng Chen, Xin Cheng, Haiyang Xu, Zhenglin Wan, Ming Yan, and Bo An. AgentOCR: Reimagining Agent History via Optical Self-Compression, February

  15. [15]

    URL http://arxiv.org/abs/2601. 04786. arXiv:2601.04786 [cs]

  16. [16]

    Openai gpt-5 sys- tem card, 2026

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, Ak- shay Nathan, Alan Luo, et al. Openai gpt-5 sys- tem card, 2026. URLhttps://arxiv.org/abs/ 2601.03267

  17. [17]

    Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL 10 Optical Reasoni...

  18. [18]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Ji- ahao Chen, et al. Kimi k2.5: Visual agentic intel- ligence, 2026. URL https://arxiv.org/abs/ 2602.02276

  19. [19]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixua...

  20. [20]

    Mint-cot: Enablinginterleavedvisualtokensinmath- ematical chain-of-thought reasoning, 2025

    Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, and Hongsheng Li. Mint-cot: Enablinginterleavedvisualtokensinmath- ematical chain-of-thought reasoning, 2025. URL https://arxiv.org/abs/2506.05331

  21. [21]

    Imagine while reasoning in space: Multi- modal visualization-of-thought, 2025

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, and Furu Wei. Imagine while reasoning in space: Multi- modal visualization-of-thought, 2025. URLhttps: //arxiv.org/abs/2501.07542

  22. [22]

    Zebra-cot: A dataset for in- terleaved vision language reasoning, 2025

    Ang Li, Charles Wang, Deqing Fu, Kaiyu Yue, Zikui Cai, Wang Bill Zhu, Ollie Liu, Peng Guo, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum. Zebra-cot: A dataset for in- terleaved vision language reasoning, 2025. URL https://arxiv.org/abs/2507.16746

  23. [23]

    Glyph: Scaling context windows via visual-text compression, 2025

    Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, and Minlie Huang. Glyph: Scaling context windows via visual-text compression, 2025. URL https://arxiv.org/abs/2510.17800

  24. [24]

    Vtc-r1: Vision-text compression for efficient long-context reasoning, 2026

    Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong cheng Tu, Chengyu Wang, Jun Huang, and Dacheng Tao. Vtc-r1: Vision-text compression for efficient long-context reasoning, 2026. URLhttps: //arxiv.org/abs/2601.22069

  25. [25]

    Render-of-thought: Rendering textual chain-of-thought as images for vi- sual latent reasoning, 2026

    Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, and Zheng Wei. Render-of-thought: Rendering textual chain-of-thought as images for vi- sual latent reasoning, 2026. URLhttps://arxiv. org/abs/2601.14750

  26. [26]

    Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems.CoRR, abs/1705.04146, 2017

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems.CoRR, abs/1705.04146, 2017. URL http://arxiv.org/abs/1705.04146

  27. [27]

    Trainingverifierstosolvemathwordproblems.arXiv preprint arXiv:2110.14168, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Trainingverifierstosolvemathwordproblems.arXiv preprint arXiv:2110.14168, 2021

  28. [28]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof qa benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022

  29. [29]

    Learn to explain: Mul- timodal reasoning via thought chains for science question answering, 2022

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Mul- timodal reasoning via thought chains for science question answering, 2022. URLhttps://arxiv. org/abs/2209.09513

  30. [30]

    matplotlib–a portable python plotting package

    Paul Barrett, John Hunter, J Todd Miller, J-C Hsu, and Perry Greenfield. matplotlib–a portable python plotting package. InAstronomical data analysis soft- ware and systems XIV, volume 347, page 91, 2005

  31. [31]

    verdict":

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression, 2024. URLhttps://arxiv. org/abs/2403.12968. A Case Study To qualitatively illustrate why o...