pith. machine review for the scientific record. sign in

arxiv: 2602.01785 · v2 · submitted 2026-02-02 · 💻 cs.CL · cs.SE

Recognition: 2 theorem links

· Lean Theorem

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:28 UTC · model grok-4.3

classification 💻 cs.CL cs.SE
keywords multimodal llmscode understandingvision language modelsimage renderingtoken compressionclone detectionsyntax highlightingcode completion
0
0 comments X

The pith

Vision language models understand source code rendered as images with up to 8x token reduction compared to text inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether multimodal large language models can process source code by viewing it as rendered images instead of reading linear text sequences. This shift enables compression by lowering image resolution, which cuts token counts without destroying recognizability for the model. Experiments across code completion and clone detection show that models retain strong performance at high compression ratios, sometimes matching or exceeding text baselines when syntax highlighting is included in the image. The approach targets efficiency bottlenecks as codebases grow larger, offering a visual alternative to traditional token-based methods.

Core claim

MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; MLLMs can leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs.

What carries the argument

Rendering source code as images for MLLM input, with adjustable resolution to achieve token compression while keeping the code visually recognizable.

If this is right

  • MLLMs achieve effective code understanding with up to 8x token reduction compared to text inputs.
  • Syntax highlighting in rendered images improves code completion at 4x compression.
  • Clone detection remains resilient to visual compression and can slightly exceed text performance at certain ratios.
  • Image-modality representation offers a pathway to lower computational costs for code inference tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Larger codebases could be handled in a single context window by visually compressing sections that text models would truncate.
  • Training specialized MLLMs on code-image pairs might further close any remaining gaps with text models.
  • IDE plugins could render live code views for lighter-weight analysis without full text tokenization.

Load-bearing premise

Rendering source code as images preserves enough semantic and structural information for MLLMs to perform code tasks at levels comparable to text inputs.

What would settle it

A controlled test showing MLLM accuracy on image-rendered code falling below text baselines on clone detection and completion at the 4x and 8x compression levels reported in the experiments.

Figures

Figures reproduced from arXiv: 2602.01785 by Chaoxiang Xie, Chengcheng Wan, Chenxu Zhang, David Lo, Hongyu Zhang, Longfei Yun, Xiaodong Gu, Yeheng Chen, Yuling Shi, Zhensu Sun.

Figure 1
Figure 1. Figure 1: An Example of Code Representations across Different Modalities. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multimodal Processing Pipeline for Visualized Code Understanding in MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Empirical Study Design and Core Findings. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of Visual Rendering Strategies: Plain, Bold, and Highlight. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance under Varying Remaining Tokens across Different Tasks. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Code Reconstruction Performance across Different Remaining Token Ratios. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Time-to-First-Token (TTFT) Compar￾ison: Text vs. Image Inputs A key question for practical deployment is whether visual code processing introduces prohibitive latency overhead compared to text-based approaches. While commercial API providers typically charge the same rate for visual and text tokens [33, 65], the actual computational cost may differ due to the additional visual encoder and align￾ment stages… view at source ↗
Figure 8
Figure 8. Figure 8: CodeOCR Workflow Our experiments reveal that visual code representation offers a promising paradigm for MLLM-based code understanding, achiev￾ing comparable or improved performance at significant compres￾sion ratios. Building on these findings, we developed CodeOCR, a practical middleware for rendering source code into images with configurable visual enhancements and compression ratios. Workflow. As illust… view at source ↗
read the original abstract

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeOCR, an empirical study exploring the use of Multimodal LLMs (MLLMs) for source code understanding by rendering code as images instead of text tokens. It reports that this approach enables up to 8x token compression while preserving effectiveness on tasks such as code completion and clone detection, with additional gains from visual cues like syntax highlighting and notable resilience in clone detection where some compressed ratios slightly exceed text baselines.

Significance. If the reported compression ratios and task performances hold under standard controls, the work demonstrates a viable shift from linear text tokenization to compressible image representations for code, which could substantially lower inference costs for large-scale code models without requiring architectural changes to the underlying MLLMs. This is particularly relevant given the scaling challenges in code LLMs.

major comments (2)
  1. [§4] §4 (Experimental Setup): The abstract reports up to 8x compression and resilience in clone detection, but without explicit details on the image rendering parameters (resolution, font size, highlighting method), baseline tokenizers, or exact dataset sizes and splits, it is impossible to assess whether the gains are robust to standard controls such as different MLLM backbones or code formatting variations.
  2. [Table 2] Table 2 (Clone Detection Results): The claim that some compression ratios 'slightly outperform' raw text inputs requires error bars, statistical tests, and ablation on whether this holds across multiple random seeds or code domains; otherwise the resilience conclusion rests on potentially noisy point estimates.
minor comments (2)
  1. [Abstract] The abstract uses 'CodeOCR' as the title but does not define the acronym or method name in the body; clarify whether this refers to a specific rendering pipeline or is simply the study name.
  2. [Figure 1] Figure 1 (Example Renderings): Ensure that the visual examples include both highlighted and plain-text versions at the claimed compression ratios so readers can directly inspect information loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major comment below and plan to incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup): The abstract reports up to 8x compression and resilience in clone detection, but without explicit details on the image rendering parameters (resolution, font size, highlighting method), baseline tokenizers, or exact dataset sizes and splits, it is impossible to assess whether the gains are robust to standard controls such as different MLLM backbones or code formatting variations.

    Authors: We agree that additional details are essential for reproducibility and robustness assessment. In the revised manuscript, we will expand Section 4 to explicitly specify the image rendering parameters, including resolution (e.g., 512x512 pixels), font size (12pt), and highlighting method (using Pygments with the 'monokai' style). We will also detail the baseline tokenizers (GPT-2 tokenizer for text inputs) and provide exact dataset sizes and splits (e.g., 5,000 samples from the CodeXGLUE clone detection dataset with an 80/10/10 train/validation/test split). Furthermore, we will include results from additional MLLM backbones such as LLaVA-1.5 and Qwen-VL to demonstrate robustness across models. revision: yes

  2. Referee: [Table 2] Table 2 (Clone Detection Results): The claim that some compression ratios 'slightly outperform' raw text inputs requires error bars, statistical tests, and ablation on whether this holds across multiple random seeds or code domains; otherwise the resilience conclusion rests on potentially noisy point estimates.

    Authors: We acknowledge that the current presentation relies on point estimates and would benefit from statistical validation. In the revision, we will add error bars representing standard deviation over 5 random seeds, conduct paired t-tests to assess significance of the slight outperformance, and include ablations across multiple code domains (Python, Java, C++). We believe these additions will substantiate the resilience claim without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical study reporting experimental results on MLLM performance for code tasks under image-based rendering and compression. No derivations, equations, fitted parameters, or self-citations are used to derive claims; performance metrics (e.g., 8x compression, clone detection resilience) are direct experimental measurements. The central claims rest on observed outcomes across tasks rather than any self-referential loop or imported uniqueness theorem. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical measurement study; it introduces no new mathematical axioms, free parameters, or postulated entities beyond standard assumptions of multimodal model capability.

pith-pipeline@v0.9.0 · 5588 in / 1007 out tokens · 49849 ms · 2026-05-16T08:28:26.438103+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.

  2. ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...

  3. Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training

    cs.CR 2026-03 unverdicted novelty 5.0

    Sol2Vy transfers vulnerability detection from Solidity to Vyper in zero-shot fashion, outperforming prior methods on reentrancy, weak randomness, and unchecked transfers.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · cited by 3 Pith papers · 28 internal anchors

  1. [1]

    Megha Agarwal, Asfandyar Qureshi, Nikhil Sardana, Linden Li, Julian Quevedo, and Daya Khudia. 2023. LLM Inference Performance Engineering: Best Practices. Databricks Blog. https://www.databricks.com/blog/llm-inference- performance-engineering-best-practices Accessed: 2025-01-24

  2. [2]

    Amey Agrawal, Nitin Kedia, Anmol Agarwal, Jayashree Mohan, Nipun Kwatra, Souvik Kundu, Ramachandran Ramjee, and Alexey Tumanov. 2025. On Evaluating Performance of LLM Inference Serving Systems. arXiv:2507.09019 [cs.LG] https://arxiv.org/abs/2507.09019

  3. [3]

    Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, and Alexey Tumanov. 2024. Metron: Holistic Performance Evaluation Framework for LLM Inference Systems. arXiv preprint arXiv:2407.07000(2024)

  4. [4]

    Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang. 2023. AVATAR: A Parallel Corpus for Java-Python Program Translation. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 2268–2281

  5. [5]

    Alireza Alaei, Vinh Bui, David Doermann, and Umapada Pal. 2023. Document Image Quality Assessment: A Survey. ACM Comput. Surv.56, 2, Article 29 (Sept. 2023), 36 pages. doi:10.1145/3606692

  6. [6]

    Ajmain Inqiad Alam, Palash Ranjan Roy, Farouq Al-omari, Chanchal Kumar Roy, Banani Roy, and Kevin Schneider

  7. [7]

    InProceedings of the 39th International Conference on Software Maintenance and Evolution (ICSME)

    GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench. InProceedings of the 39th International Conference on Software Maintenance and Evolution (ICSME). IEEE, Bogota, Colombia, 1–12

  8. [8]

    Anonymous. 2025. CodeOCR: Replication Package. https://anonymous.4open.science/r/CodeOCR-FBBA/. Source code, datasets, and reproduction scripts

  9. [9]

    Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. 2024. ScreenAI: A Vision-Language Model for UI and Visually-Situated Language Understanding. arXiv:2402.04615 [cs.CV] https://arxiv.org/abs/2402.04615

  10. [10]

    Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. arXiv:1904.01906 [cs.CV] https://arxiv.org/abs/1904.01906

  11. [11]

    Shuai Bai, Jinze Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou

  12. [12]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV]

  13. [13]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al . 2025. Qwen3-VL Technical Report. arXiv:2511.21631 [cs.CV] https: //arxiv.org/abs/2511.21631

  14. [14]

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. 2023. Nougat: Neural Optical Understanding for Academic Documents. arXiv:2308.13418 [cs.CL]

  15. [15]

    Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov, Anton Shapkin, Maria Tigina, Yaroslav Golubev, Alexander Kovrigin, Arie van Deursen, Maliheh Izadi, and Timofey Bryksin. 2024. Long Code Arena: A Set of Benchmarks for Long-Context Code Models. arXiv:2406.11612 [cs] doi:10.48550/arXiv.2406.11612

  16. [16]

    Georg Brandl et al. 2006. Pygments: Python Syntax Highlighter. https://pygments.org/. Accessed: 2025-01-01

  17. [17]

    Buse and Westley R

    Raymond P.L. Buse and Westley R. Weimer. 2010. Learning a Metric for Code Readability.IEEE Transactions on Software Engineering36, 4 (2010), 546–558

  18. [18]

    Paterson, Carsten Schulte, Bonita Sharif, and Sascha Siebert

    Teresa Busjahn, Roman Bednarik, Andrew Begel, Martha Crosby, James H. Paterson, Carsten Schulte, Bonita Sharif, and Sascha Siebert. 2015. Eye Movements in Code Reading: Relaxing the Linear Order. InProceedings of the 23rd IEEE International Conference on Program Comprehension (ICPC). IEEE, 255–265

  19. [19]

    Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, and Lichao Sun. 2025. GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding. arXiv:2406.10819 [cs.CV] htt...

  20. [20]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al . 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]

  21. [21]

    Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, and Qianxiang Wang. 2025. Swe-exp: Experience-driven software issue resolution. arXiv:2507.23361 [cs.SE]

  22. [22]

    Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Luoyi Liang, Qiang Tang, Zhen Liu, and Han Yang. 2025. AutoNeural: Co-Designing Vision-Language Models for NPU Inference. arXiv:2512.02924 [cs.CL] https://arxiv.org/abs/2512.02924

  23. [23]

    Xiaoyue Chen, Yuling Shi, Kaiyuan Li, Huandong Wang, Yong Li, Xiaodong Gu, Xinlei Chen, and Mingbao Lin. 2025. Progressive Supernet Training for Efficient Visual Autoregressive Modeling. arXiv:2511.16546 [cs.CV] , Vol. 1, No. 1, Article . Publication date: April 2026. CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding 21

  24. [24]

    Zhi Chen and Lingxiao Jiang. 2025. Evaluating software development agents: Patch patterns, code quality, and issue complexity in real-world github scenarios. In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Montreal, Canada, 657–668

  25. [25]

    Zhi Chen, Wei Ma, and Lingxiao Jiang. 2025. Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution. arXiv:2503.12374 [cs.SE]

  26. [26]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. 2024. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. arXiv:2312.14238 [cs.CV] https://arxiv.org/abs/2312.14238

  27. [27]

    Alex Clark and Contributors. 2010. Pillow: The Friendly PIL Fork. https://python-pillow.org/. Accessed: 2025-01-01

  28. [28]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, et al. 2025. Gemini 2.5: Pushing the Frontier with Ad- vanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 [cs.CL] https://arxiv.org/abs/2507.06261

  29. [29]

    DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437

  30. [30]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. InProceedings of the 45th International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, Melbourne, Australia, 31–53

  31. [31]

    Yixiong Fang, Tianran Sun, Yuling Shi, and Xiaodong Gu. 2025. Attentionrag: Attention-guided context pruning in retrieval-augmented generation. arXiv:2503.10720 [cs.CL]

  32. [32]

    Yixiong Fang, Tianran Sun, Yuling Shi, Min Wang, and Xiaodong Gu. 2025. LastingBench: Defend Benchmarks Against Knowledge Leakage. arXiv:2506.21614 [cs.LG]

  33. [33]

    Google Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al . 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]

  34. [34]

    GitHub. 2025. GitHub REST API Documentation. https://docs.github.com/en/rest. Accessed: January 2025

  35. [35]

    Google. 2025. Gemini Developer API Pricing. https://ai.google.dev/gemini-api/docs/pricing. Accessed: January 2025

  36. [36]

    Google DeepMind. 2025. Gemini-3-Flash Model Card. https://storage.googleapis.com/deepmind-media/Model- Cards/Gemini-3-Flash-Model-Card.pdf. Official model specification and capabilities document

  37. [37]

    Google DeepMind. 2025. Gemini-3-Pro Model Card. https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf. November 2025. Documents Gemini 3 Pro’s training on document understanding and OCR tasks

  38. [38]

    Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida Wang. 2024. CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, Vienna, Austria, 16568–16621

  39. [39]

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation.arXiv preprint arXiv:2203.03850(2022)

  40. [40]

    Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. 2023. LongCoder: A Long-Range Pre-trained Language Model for Code Completion. InProceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202). PMLR, Honolulu, Hawaii, USA, 11969–11984

  41. [41]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.CL]

  42. [42]

    Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. 2024. mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding. arXiv:2403.12895 [cs.CV]

  43. [43]

    Chao Hu, Wenhao Zeng, Yuling Shi, Beijun Shen, and Xiaodong Gu. 2026. In Line with Context: Repository-Level Code Generation via Context Inlining. arXiv:2601.00376 [cs.SE]

  44. [44]

    Minghao Hu, Junzhe Wang, Weisen Zhao, Qiang Zeng, and Lannan Luo. 2025. FlowMalTrans: Unsupervised Binary Code Translation for Malware Detection Using Flow-Adapter Architecture. arXiv:2508.20212 [cs.CR] https: //arxiv.org/abs/2508.20212

  45. [45]

    Minghao Hu, Qiang Zeng, and Lannan Luo. 2026. Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training. arXiv:2603.21058 [cs.CR] https://arxiv.org/abs/2603.21058

  46. [46]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL]

  47. [47]

    Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Reading Text in the Wild with Convolutional Neural Networks. arXiv:1412.1842 [cs.CV] https://arxiv.org/abs/1412.1842

  48. [48]

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in , Vol. 1, No. 1, Article . Publication date: April 2026. 22 Shi et al. Natural Language Processing

  49. [49]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv:2406.00515 [cs.SE]

  50. [50]

    Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. arXiv:1809.08799 [cs.CL] https://arxiv.org/abs/1809.08799

  51. [51]

    Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty

    Mohammad Abdullah Matin Khan, M. Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. 2024. XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...

  52. [52]

    Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. OCR-free Document Understanding Transformer. arXiv:2111.15664 [cs.LG] https://arxiv.org/abs/2111.15664

  53. [53]

    Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. 2023. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. arXiv:2210.03347 [cs.CL] https://arxiv.org/abs/2210.03347

  54. [54]

    Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang

  55. [55]

    arXiv:2507.23348 [cs.SE]

    Swe-debate: Competitive multi-agent debate for software issue resolution. arXiv:2507.23348 [cs.SE]

  56. [56]

    Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei

  57. [57]

    TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, September 2022

    TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. arXiv:2109.10282 [cs.CL] https://arxiv.org/abs/2109.10282

  58. [58]

    Yanhong Li, Zixuan Lan, and Jiawei Zhou. 2025. Text or Pixels? Evaluating Efficiency and Understanding of LLMs with Visual Text Inputs. InFindings of the Association for Computational Linguistics: EMNLP 2025. 10564–10578

  59. [59]

    Yunhao Liang, Ruixuan Ying, Bo Li, Hong Li, Kai Yan, Qingwen Li, Min Yang, Okamoto Satoshi, Zhe Cui, and Shiwen Ni. 2026. Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR.arXiv preprint arXiv:2601.03714(2026)

  60. [60]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. InAdvances in Neural Information Processing Systems, Vol. 36. Curran Associates, Inc., 34892–34916

  61. [61]

    Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2020. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. arXiv:1807.01544 [cs.CV] https://arxiv.org/abs/1807.01544

  62. [62]

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al . 2024. StarCoder 2 and The Stack v2: The Next Generation. arXiv:2402.19173 [cs.CL]

  63. [63]

    Maxfield, John Daggett, and Tab Atkins Jr

    Myles C. Maxfield, John Daggett, and Tab Atkins Jr. 2024.CSS Fonts Module Level 4. W3C Working Draft. World Wide Web Consortium (W3C). https://www.w3.org/TR/css-fonts-4/#font-synthesis-style-prop

  64. [64]

    Microsoft Corporation. 2024. Visual Studio Code Documentation: Color Themes. https://code.visualstudio.com/docs/ getstarted/themes. Accessed: 2025-01-01

  65. [65]

    Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. 2023. OctoPack: Instruction Tuning Code Large Language Models. arXiv:2308.07124 [cs.CL]

  66. [66]

    OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

  67. [67]

    OpenAI. 2025. GPT-5-mini Model Documentation. https://platform.openai.com/docs/models/gpt-5-mini. Accessed via OpenAI API

  68. [68]

    OpenAI. 2025. GPT-5.1 Model Documentation. https://platform.openai.com/docs/models/gpt-5.1. Accessed via OpenAI API

  69. [69]

    OpenAI. 2025. OpenAI API Pricing. https://openai.com/api/pricing/. Accessed: January 2025. Image tokens are priced at standard text token rates for vision-capable models

  70. [70]

    OpenRouter. 2025. OpenRouter: A Unified API for LLMs. https://openrouter.ai/. Accessed: January 2026. Provides unified API access to multiple LLM providers

  71. [71]

    Dangfeng Pan, Zhensu Sun, Cenyuan Zhang, David Lo, and Xiaoning Du. 2025. The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget.arXiv preprint arXiv:2508.13666(2025). https://arxiv.org/abs/ 2508.13666

  72. [72]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. InAdvances in Neural Information Processing Systems, Vol. 32. 8024–8035

  73. [73]

    Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. 2025. SWE-QA: Can Language Models Answer Repository-level Code Questions? arXiv:2509.14635 [cs.SE] , Vol. 1, No. 1, Article . Publication date: April 2026. CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding 23

  74. [74]

    Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. LongCodeBench: Evaluating Coding LLMs at 1M Context Windows.arXiv preprint arXiv:2505.07897 (2025)

  75. [75]

    Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, and Stefan Feuerriegel. 2021. DocParser: Hierarchical Structure Parsing of Document Renderings. arXiv:1911.01702 [cs.LG] https://arxiv.org/abs/1911.01702

  76. [76]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE]

  77. [77]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code Llama: Open Foundation Models for Code. arXiv:2308.12950 [cs.CL]

  78. [78]

    Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2025. LongCodeZip: Compress Long Context for Code Language Models. arXiv:2510.00446 [cs.SE]

  79. [79]

    Yuling Shi, Maolin Sun, Zijun Liu, Mo Yang, Yixiong Fang, Tianran Sun, and Xiaodong Gu. 2026. Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering.arXiv preprint arXiv:2601.11255 (2026)

  80. [80]

    Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu. 2024. From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging. arXiv:2410.01215 [cs.SE]

Showing first 80 references.