arxiv: 2602.01785 · v2 · submitted 2026-02-02 · 💻 cs.CL · cs.SE

Recognition: 2 theorem links

· Lean Theorem

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Yuling Shi , Chaoxiang Xie , Zhensu Sun , Yeheng Chen , Chenxu Zhang , Longfei Yun , Chengcheng Wan , Hongyu Zhang

show 2 more authors

David Lo Xiaodong Gu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:28 UTC · model grok-4.3

classification 💻 cs.CL cs.SE

keywords multimodal llmscode understandingvision language modelsimage renderingtoken compressionclone detectionsyntax highlightingcode completion

0 comments

The pith

Vision language models understand source code rendered as images with up to 8x token reduction compared to text inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether multimodal large language models can process source code by viewing it as rendered images instead of reading linear text sequences. This shift enables compression by lowering image resolution, which cuts token counts without destroying recognizability for the model. Experiments across code completion and clone detection show that models retain strong performance at high compression ratios, sometimes matching or exceeding text baselines when syntax highlighting is included in the image. The approach targets efficiency bottlenecks as codebases grow larger, offering a visual alternative to traditional token-based methods.

Core claim

MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; MLLMs can leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs.

What carries the argument

Rendering source code as images for MLLM input, with adjustable resolution to achieve token compression while keeping the code visually recognizable.

If this is right

MLLMs achieve effective code understanding with up to 8x token reduction compared to text inputs.
Syntax highlighting in rendered images improves code completion at 4x compression.
Clone detection remains resilient to visual compression and can slightly exceed text performance at certain ratios.
Image-modality representation offers a pathway to lower computational costs for code inference tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Larger codebases could be handled in a single context window by visually compressing sections that text models would truncate.
Training specialized MLLMs on code-image pairs might further close any remaining gaps with text models.
IDE plugins could render live code views for lighter-weight analysis without full text tokenization.

Load-bearing premise

Rendering source code as images preserves enough semantic and structural information for MLLMs to perform code tasks at levels comparable to text inputs.

What would settle it

A controlled test showing MLLM accuracy on image-rendered code falling below text baselines on clone detection and completion at the 4x and 8x compression levels reported in the experiments.

Figures

Figures reproduced from arXiv: 2602.01785 by Chaoxiang Xie, Chengcheng Wan, Chenxu Zhang, David Lo, Hongyu Zhang, Longfei Yun, Xiaodong Gu, Yeheng Chen, Yuling Shi, Zhensu Sun.

**Figure 2.** Figure 2: Multimodal Processing Pipeline for Visualized Code Understanding in MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Empirical Study Design and Core Findings. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of Visual Rendering Strategies: Plain, Bold, and Highlight. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Performance under Varying Remaining Tokens across Different Tasks. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Code Reconstruction Performance across Different Remaining Token Ratios. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Time-to-First-Token (TTFT) Comparison: Text vs. Image Inputs A key question for practical deployment is whether visual code processing introduces prohibitive latency overhead compared to text-based approaches. While commercial API providers typically charge the same rate for visual and text tokens [33, 65], the actual computational cost may differ due to the additional visual encoder and alignment stages… view at source ↗

**Figure 8.** Figure 8: CodeOCR Workflow Our experiments reveal that visual code representation offers a promising paradigm for MLLM-based code understanding, achieving comparable or improved performance at significant compression ratios. Building on these findings, we developed CodeOCR, a practical middleware for rendering source code into images with configurable visual enhancements and compression ratios. Workflow. As illust… view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLLMs can read code from images for up to 8x token compression, with clone detection holding up especially well.

read the letter

The paper shows that MLLMs can handle code understanding by looking at rendered images instead of text tokens, getting up to 8x compression. Clone detection holds up particularly well, sometimes even improving slightly over text at certain ratios. What is new is the systematic testing of this image-based approach for code. The work checks multiple compression levels, includes syntax highlighting as a visual aid, and measures effects on completion and clone tasks. This gives a clear picture of where the method works and where it might not. The experiments back the efficiency claim with concrete ratios and task results. It is a straightforward way to think about reducing costs for large code models. One soft spot is that success hinges on the image rendering preserving key code features. At high compression, small details might get lost, and the paper probably discusses this. More work on different code styles or languages would help confirm the resilience. This paper is for researchers focused on efficient inference for code LLMs or multimodal models applied to software. Anyone dealing with long code contexts could see value in the compression data. I would send it for peer review. The idea has practical legs and the results are worth a detailed check.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeOCR, an empirical study exploring the use of Multimodal LLMs (MLLMs) for source code understanding by rendering code as images instead of text tokens. It reports that this approach enables up to 8x token compression while preserving effectiveness on tasks such as code completion and clone detection, with additional gains from visual cues like syntax highlighting and notable resilience in clone detection where some compressed ratios slightly exceed text baselines.

Significance. If the reported compression ratios and task performances hold under standard controls, the work demonstrates a viable shift from linear text tokenization to compressible image representations for code, which could substantially lower inference costs for large-scale code models without requiring architectural changes to the underlying MLLMs. This is particularly relevant given the scaling challenges in code LLMs.

major comments (2)

[§4] §4 (Experimental Setup): The abstract reports up to 8x compression and resilience in clone detection, but without explicit details on the image rendering parameters (resolution, font size, highlighting method), baseline tokenizers, or exact dataset sizes and splits, it is impossible to assess whether the gains are robust to standard controls such as different MLLM backbones or code formatting variations.
[Table 2] Table 2 (Clone Detection Results): The claim that some compression ratios 'slightly outperform' raw text inputs requires error bars, statistical tests, and ablation on whether this holds across multiple random seeds or code domains; otherwise the resilience conclusion rests on potentially noisy point estimates.

minor comments (2)

[Abstract] The abstract uses 'CodeOCR' as the title but does not define the acronym or method name in the body; clarify whether this refers to a specific rendering pipeline or is simply the study name.
[Figure 1] Figure 1 (Example Renderings): Ensure that the visual examples include both highlighted and plain-text versions at the claimed compression ratios so readers can directly inspect information loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major comment below and plan to incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): The abstract reports up to 8x compression and resilience in clone detection, but without explicit details on the image rendering parameters (resolution, font size, highlighting method), baseline tokenizers, or exact dataset sizes and splits, it is impossible to assess whether the gains are robust to standard controls such as different MLLM backbones or code formatting variations.

Authors: We agree that additional details are essential for reproducibility and robustness assessment. In the revised manuscript, we will expand Section 4 to explicitly specify the image rendering parameters, including resolution (e.g., 512x512 pixels), font size (12pt), and highlighting method (using Pygments with the 'monokai' style). We will also detail the baseline tokenizers (GPT-2 tokenizer for text inputs) and provide exact dataset sizes and splits (e.g., 5,000 samples from the CodeXGLUE clone detection dataset with an 80/10/10 train/validation/test split). Furthermore, we will include results from additional MLLM backbones such as LLaVA-1.5 and Qwen-VL to demonstrate robustness across models. revision: yes
Referee: [Table 2] Table 2 (Clone Detection Results): The claim that some compression ratios 'slightly outperform' raw text inputs requires error bars, statistical tests, and ablation on whether this holds across multiple random seeds or code domains; otherwise the resilience conclusion rests on potentially noisy point estimates.

Authors: We acknowledge that the current presentation relies on point estimates and would benefit from statistical validation. In the revision, we will add error bars representing standard deviation over 5 random seeds, conduct paired t-tests to assess significance of the slight outperformance, and include ablations across multiple code domains (Python, Java, C++). We believe these additions will substantiate the resilience claim without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical study reporting experimental results on MLLM performance for code tasks under image-based rendering and compression. No derivations, equations, fitted parameters, or self-citations are used to derive claims; performance metrics (e.g., 8x compression, clone detection resilience) are direct experimental measurements. The central claims rest on observed outcomes across tasks rather than any self-referential loop or imported uniqueness theorem. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical measurement study; it introduces no new mathematical axioms, free parameters, or postulated entities beyond standard assumptions of multimodal model capability.

pith-pipeline@v0.9.0 · 5588 in / 1007 out tokens · 49849 ms · 2026-05-16T08:28:26.438103+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (Jcost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; clone detection exhibits exceptional resilience to visual compression
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

visual enhancements (syntax highlighting, bold rendering) ... at 1x-4x compression

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training
cs.CR 2026-03 unverdicted novelty 5.0

Sol2Vy transfers vulnerability detection from Solidity to Vyper in zero-shot fashion, outperforming prior methods on reentrancy, weak randomness, and unchecked transfers.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · cited by 3 Pith papers · 28 internal anchors

[1]

Megha Agarwal, Asfandyar Qureshi, Nikhil Sardana, Linden Li, Julian Quevedo, and Daya Khudia. 2023. LLM Inference Performance Engineering: Best Practices. Databricks Blog. https://www.databricks.com/blog/llm-inference- performance-engineering-best-practices Accessed: 2025-01-24

work page 2023
[2]

Amey Agrawal, Nitin Kedia, Anmol Agarwal, Jayashree Mohan, Nipun Kwatra, Souvik Kundu, Ramachandran Ramjee, and Alexey Tumanov. 2025. On Evaluating Performance of LLM Inference Serving Systems. arXiv:2507.09019 [cs.LG] https://arxiv.org/abs/2507.09019

work page arXiv 2025
[3]

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, and Alexey Tumanov. 2024. Metron: Holistic Performance Evaluation Framework for LLM Inference Systems. arXiv preprint arXiv:2407.07000(2024)

work page arXiv 2024
[4]

Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang. 2023. AVATAR: A Parallel Corpus for Java-Python Program Translation. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 2268–2281

work page 2023
[5]

Alireza Alaei, Vinh Bui, David Doermann, and Umapada Pal. 2023. Document Image Quality Assessment: A Survey. ACM Comput. Surv.56, 2, Article 29 (Sept. 2023), 36 pages. doi:10.1145/3606692

work page doi:10.1145/3606692 2023
[6]

Ajmain Inqiad Alam, Palash Ranjan Roy, Farouq Al-omari, Chanchal Kumar Roy, Banani Roy, and Kevin Schneider

work page
[7]

InProceedings of the 39th International Conference on Software Maintenance and Evolution (ICSME)

GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench. InProceedings of the 39th International Conference on Software Maintenance and Evolution (ICSME). IEEE, Bogota, Colombia, 1–12

work page
[8]

Anonymous. 2025. CodeOCR: Replication Package. https://anonymous.4open.science/r/CodeOCR-FBBA/. Source code, datasets, and reproduction scripts

work page 2025
[9]

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. 2024. ScreenAI: A Vision-Language Model for UI and Visually-Situated Language Understanding. arXiv:2402.04615 [cs.CV] https://arxiv.org/abs/2402.04615

work page arXiv 2024
[10]

Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. arXiv:1904.01906 [cs.CV] https://arxiv.org/abs/1904.01906

work page arXiv 2019
[11]

Shuai Bai, Jinze Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou

work page
[12]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al . 2025. Qwen3-VL Technical Report. arXiv:2511.21631 [cs.CV] https: //arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. 2023. Nougat: Neural Optical Understanding for Academic Documents. arXiv:2308.13418 [cs.CL]

work page internal anchor Pith review arXiv 2023
[15]

Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov, Anton Shapkin, Maria Tigina, Yaroslav Golubev, Alexander Kovrigin, Arie van Deursen, Maliheh Izadi, and Timofey Bryksin. 2024. Long Code Arena: A Set of Benchmarks for Long-Context Code Models. arXiv:2406.11612 [cs] doi:10.48550/arXiv.2406.11612

work page doi:10.48550/arxiv.2406.11612 2024
[16]

Georg Brandl et al. 2006. Pygments: Python Syntax Highlighter. https://pygments.org/. Accessed: 2025-01-01

work page 2006
[17]

Buse and Westley R

Raymond P.L. Buse and Westley R. Weimer. 2010. Learning a Metric for Code Readability.IEEE Transactions on Software Engineering36, 4 (2010), 546–558

work page 2010
[18]

Paterson, Carsten Schulte, Bonita Sharif, and Sascha Siebert

Teresa Busjahn, Roman Bednarik, Andrew Begel, Martha Crosby, James H. Paterson, Carsten Schulte, Bonita Sharif, and Sascha Siebert. 2015. Eye Movements in Code Reading: Relaxing the Linear Order. InProceedings of the 23rd IEEE International Conference on Program Comprehension (ICPC). IEEE, 255–265

work page 2015
[19]

Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, and Lichao Sun. 2025. GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding. arXiv:2406.10819 [cs.CV] htt...

work page arXiv 2025
[20]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al . 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, and Qianxiang Wang. 2025. Swe-exp: Experience-driven software issue resolution. arXiv:2507.23361 [cs.SE]

work page arXiv 2025
[22]

Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Luoyi Liang, Qiang Tang, Zhen Liu, and Han Yang. 2025. AutoNeural: Co-Designing Vision-Language Models for NPU Inference. arXiv:2512.02924 [cs.CL] https://arxiv.org/abs/2512.02924

work page arXiv 2025
[23]

Xiaoyue Chen, Yuling Shi, Kaiyuan Li, Huandong Wang, Yong Li, Xiaodong Gu, Xinlei Chen, and Mingbao Lin. 2025. Progressive Supernet Training for Efficient Visual Autoregressive Modeling. arXiv:2511.16546 [cs.CV] , Vol. 1, No. 1, Article . Publication date: April 2026. CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding 21

work page arXiv 2025
[24]

Zhi Chen and Lingxiao Jiang. 2025. Evaluating software development agents: Patch patterns, code quality, and issue complexity in real-world github scenarios. In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Montreal, Canada, 657–668

work page 2025
[25]

Zhi Chen, Wei Ma, and Lingxiao Jiang. 2025. Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution. arXiv:2503.12374 [cs.SE]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. 2024. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. arXiv:2312.14238 [cs.CV] https://arxiv.org/abs/2312.14238

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Alex Clark and Contributors. 2010. Pillow: The Friendly PIL Fork. https://python-pillow.org/. Accessed: 2025-01-01

work page 2010
[28]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, et al. 2025. Gemini 2.5: Pushing the Frontier with Ad- vanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 [cs.CL] https://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. InProceedings of the 45th International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, Melbourne, Australia, 31–53

work page 2023
[31]

Yixiong Fang, Tianran Sun, Yuling Shi, and Xiaodong Gu. 2025. Attentionrag: Attention-guided context pruning in retrieval-augmented generation. arXiv:2503.10720 [cs.CL]

work page arXiv 2025
[32]

Yixiong Fang, Tianran Sun, Yuling Shi, Min Wang, and Xiaodong Gu. 2025. LastingBench: Defend Benchmarks Against Knowledge Leakage. arXiv:2506.21614 [cs.LG]

work page arXiv 2025
[33]

Google Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al . 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

GitHub. 2025. GitHub REST API Documentation. https://docs.github.com/en/rest. Accessed: January 2025

work page 2025
[35]

Google. 2025. Gemini Developer API Pricing. https://ai.google.dev/gemini-api/docs/pricing. Accessed: January 2025

work page 2025
[36]

Google DeepMind. 2025. Gemini-3-Flash Model Card. https://storage.googleapis.com/deepmind-media/Model- Cards/Gemini-3-Flash-Model-Card.pdf. Official model specification and capabilities document

work page 2025
[37]

Google DeepMind. 2025. Gemini-3-Pro Model Card. https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf. November 2025. Documents Gemini 3 Pro’s training on document understanding and OCR tasks

work page 2025
[38]

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida Wang. 2024. CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, Vienna, Austria, 16568–16621

work page 2024
[39]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation.arXiv preprint arXiv:2203.03850(2022)

work page arXiv 2022
[40]

Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. 2023. LongCoder: A Long-Range Pre-trained Language Model for Code Completion. InProceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202). PMLR, Honolulu, Hawaii, USA, 11969–11984

work page 2023
[41]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. 2024. mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding. arXiv:2403.12895 [cs.CV]

work page arXiv 2024
[43]

Chao Hu, Wenhao Zeng, Yuling Shi, Beijun Shen, and Xiaodong Gu. 2026. In Line with Context: Repository-Level Code Generation via Context Inlining. arXiv:2601.00376 [cs.SE]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Minghao Hu, Junzhe Wang, Weisen Zhao, Qiang Zeng, and Lannan Luo. 2025. FlowMalTrans: Unsupervised Binary Code Translation for Malware Detection Using Flow-Adapter Architecture. arXiv:2508.20212 [cs.CR] https: //arxiv.org/abs/2508.20212

work page arXiv 2025
[45]

Minghao Hu, Qiang Zeng, and Lannan Luo. 2026. Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training. arXiv:2603.21058 [cs.CR] https://arxiv.org/abs/2603.21058

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Reading Text in the Wild with Convolutional Neural Networks. arXiv:1412.1842 [cs.CV] https://arxiv.org/abs/1412.1842

work page internal anchor Pith review Pith/arXiv arXiv 2014
[48]

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in , Vol. 1, No. 1, Article . Publication date: April 2026. 22 Shi et al. Natural Language Processing

work page 2023
[49]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv:2406.00515 [cs.SE]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. arXiv:1809.08799 [cs.CL] https://arxiv.org/abs/1809.08799

work page internal anchor Pith review Pith/arXiv arXiv 2018
[51]

Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty

Mohammad Abdullah Matin Khan, M. Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. 2024. XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...

work page 2024
[52]

Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. OCR-free Document Understanding Transformer. arXiv:2111.15664 [cs.LG] https://arxiv.org/abs/2111.15664

work page arXiv 2022
[53]

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. 2023. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. arXiv:2210.03347 [cs.CL] https://arxiv.org/abs/2210.03347

work page arXiv 2023
[54]

Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang

work page
[55]

arXiv:2507.23348 [cs.SE]

Swe-debate: Competitive multi-agent debate for software issue resolution. arXiv:2507.23348 [cs.SE]

work page arXiv
[56]

Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei

work page
[57]

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, September 2022

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. arXiv:2109.10282 [cs.CL] https://arxiv.org/abs/2109.10282

work page arXiv
[58]

Yanhong Li, Zixuan Lan, and Jiawei Zhou. 2025. Text or Pixels? Evaluating Efficiency and Understanding of LLMs with Visual Text Inputs. InFindings of the Association for Computational Linguistics: EMNLP 2025. 10564–10578

work page 2025
[59]

Yunhao Liang, Ruixuan Ying, Bo Li, Hong Li, Kai Yan, Qingwen Li, Min Yang, Okamoto Satoshi, Zhe Cui, and Shiwen Ni. 2026. Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR.arXiv preprint arXiv:2601.03714(2026)

work page arXiv 2026
[60]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. InAdvances in Neural Information Processing Systems, Vol. 36. Curran Associates, Inc., 34892–34916

work page 2023
[61]

Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2020. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. arXiv:1807.01544 [cs.CV] https://arxiv.org/abs/1807.01544

work page arXiv 2020
[62]

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al . 2024. StarCoder 2 and The Stack v2: The Next Generation. arXiv:2402.19173 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Maxfield, John Daggett, and Tab Atkins Jr

Myles C. Maxfield, John Daggett, and Tab Atkins Jr. 2024.CSS Fonts Module Level 4. W3C Working Draft. World Wide Web Consortium (W3C). https://www.w3.org/TR/css-fonts-4/#font-synthesis-style-prop

work page 2024
[64]

Microsoft Corporation. 2024. Visual Studio Code Documentation: Color Themes. https://code.visualstudio.com/docs/ getstarted/themes. Accessed: 2025-01-01

work page 2024
[65]

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. 2023. OctoPack: Instruction Tuning Code Large Language Models. arXiv:2308.07124 [cs.CL]

work page arXiv 2023
[66]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

OpenAI. 2025. GPT-5-mini Model Documentation. https://platform.openai.com/docs/models/gpt-5-mini. Accessed via OpenAI API

work page 2025
[68]

OpenAI. 2025. GPT-5.1 Model Documentation. https://platform.openai.com/docs/models/gpt-5.1. Accessed via OpenAI API

work page 2025
[69]

OpenAI. 2025. OpenAI API Pricing. https://openai.com/api/pricing/. Accessed: January 2025. Image tokens are priced at standard text token rates for vision-capable models

work page 2025
[70]

OpenRouter. 2025. OpenRouter: A Unified API for LLMs. https://openrouter.ai/. Accessed: January 2026. Provides unified API access to multiple LLM providers

work page 2025
[71]

Dangfeng Pan, Zhensu Sun, Cenyuan Zhang, David Lo, and Xiaoning Du. 2025. The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget.arXiv preprint arXiv:2508.13666(2025). https://arxiv.org/abs/ 2508.13666

work page arXiv 2025
[72]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. InAdvances in Neural Information Processing Systems, Vol. 32. 8024–8035

work page 2019
[73]

Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. 2025. SWE-QA: Can Language Models Answer Repository-level Code Questions? arXiv:2509.14635 [cs.SE] , Vol. 1, No. 1, Article . Publication date: April 2026. CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding 23

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. LongCodeBench: Evaluating Coding LLMs at 1M Context Windows.arXiv preprint arXiv:2505.07897 (2025)

work page arXiv 2025
[75]

Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, and Stefan Feuerriegel. 2021. DocParser: Hierarchical Structure Parsing of Document Renderings. arXiv:1911.01702 [cs.LG] https://arxiv.org/abs/1911.01702

work page arXiv 2021
[76]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[77]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2025. LongCodeZip: Compress Long Context for Code Language Models. arXiv:2510.00446 [cs.SE]

work page arXiv 2025
[79]

Yuling Shi, Maolin Sun, Zijun Liu, Mo Yang, Yixiong Fang, Tianran Sun, and Xiaodong Gu. 2026. Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering.arXiv preprint arXiv:2601.11255 (2026)

work page arXiv 2026
[80]

Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu. 2024. From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging. arXiv:2410.01215 [cs.SE]

work page arXiv 2024

Showing first 80 references.