arxiv: 2604.26622 · v1 · submitted 2026-04-29 · 💻 cs.CL

Recognition: unknown

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

Edith Cheuk-Han Ngai, Jiayi Qu, Jinfeng Xu, Jinze Li, Junhua Ding, Shuo Yang, Xin Yang, Yang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords agent memorylong-horizon agentsvisual retrievalLLM agentscontext managementtrajectory renderinglocate-and-transcribe

0 comments

The pith

OCR-Memory renders agent histories as images with visual markers so agents can locate and transcribe exact past text without token overload.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents in long interactive tasks need to reuse experiences from extended histories, but text prompts quickly exceed token budgets and force costly summarization that loses details. OCR-Memory converts those histories into images annotated with unique visual identifiers instead of keeping raw text. At retrieval time the system locates the relevant image region via the visual anchors and transcribes the original verbatim text from that spot. This approach keeps memory capacity high while holding prompt size low and avoiding generated summaries that could introduce errors. Experiments on long-horizon agent benchmarks show performance gains when context length is strictly limited.

Core claim

OCR-Memory renders historical trajectories into images annotated with unique visual identifiers. It retrieves stored experience via a locate-and-transcribe paradigm that selects relevant regions through visual anchors and retrieves the corresponding verbatim text, avoiding free-form generation and reducing hallucination. This enables retention of arbitrarily long histories with minimal prompt overhead at retrieval time.

What carries the argument

The locate-and-transcribe paradigm, which renders trajectories as images with visual anchors, selects relevant regions visually, and transcribes the exact original text from those regions.

If this is right

Agents can retain and reuse experience from histories that would otherwise exceed token limits.
Evidence recovery stays verbatim rather than relying on potentially lossy summaries.
Prompt overhead remains low even as the number of past steps grows arbitrarily.
Consistent performance improvements appear under strict context budgets on standard long-horizon benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same optical rendering and retrieval steps could apply to long-document question answering by turning documents into annotated image sets.
Hybrid systems might combine OCR-Memory with existing text-only memories for tasks where some information is better kept in raw text.
Performance would likely scale with the accuracy of the underlying vision-language model used for location and transcription.

Load-bearing premise

Rendering trajectories as images with visual identifiers and transcribing text from located regions preserves all original information without loss or errors in the transcription step.

What would settle it

A benchmark run where the text transcribed from the located image region differs from the original trajectory text on any long-horizon agent task.

Figures

Figures reproduced from arXiv: 2604.26622 by Edith Cheuk-Han Ngai, Jiayi Qu, Jinfeng Xu, Jinze Li, Junhua Ding, Shuo Yang, Xin Yang, Yang Zhang.

**Figure 1.** Figure 1: Overview of the OCR-Memory. The system enables long-horizon agent memory by storing interaction view at source ↗

**Figure 2.** Figure 2: Performance comparison under varying con view at source ↗

read the original abstract

Autonomous LLM agents increasingly operate in long-horizon, interactive settings where success depends on reusing experience accumulated over extended histories. However, existing agent memory systems are fundamentally constrained by text-context budgets: storing or revisiting raw trajectories is prohibitively token-expensive, while summarization and text-only retrieval trade token savings for information loss and fragmented evidence. To address this limitation, we propose Optical Context Retrieval Memory (OCR-Memory), a memory framework that leverages the visual modality as a high-density representation of agent experience, enabling retention of arbitrarily long histories with minimal prompt overhead at retrieval time. Specifically, OCR-Memory renders historical trajectories into images annotated with unique visual identifiers. OCR-Memory retrieves stored experience via a \emph{locate-and-transcribe} paradigm that selects relevant regions through visual anchors and retrieves the corresponding verbatim text, avoiding free-form generation and reducing hallucination. Experiments on long-horizon agent benchmarks show consistent gains under strict context limits, demonstrating that optical encoding increases effective memory capacity while preserving faithful evidence recovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OCR-Memory renders agent histories to annotated images for token-efficient retrieval, but the paper leaves the accuracy of its locate-and-transcribe step on structured data unverified.

read the letter

The paper introduces OCR-Memory as a way to handle long agent trajectories by rendering them into images with visual anchors, then using a locate-and-transcribe process to pull back exact text on demand. This targets the token budget problem in interactive LLM agents where raw history grows too large and summarization loses details. The visual route is presented as a denser alternative that still aims for verbatim recovery rather than free generation or embedding lookup. That combination is the clearest new element here. It does address a practical bottleneck that many agent papers run into once histories exceed a few thousand tokens. The abstract reports consistent gains on long-horizon benchmarks when context is strictly limited, which at least shows the method can be made to work in some settings. The locate-and-transcribe framing also avoids some of the hallucination risks that come with pure generation-based memory. Those are real points in its favor. The soft spot is that the central claim rests on faithful recovery of precise information, yet the evaluation does not appear to test this directly. Agent trajectories often contain exact numbers, JSON fields, coordinates, or code snippets that can shift or degrade when rendered as images and then transcribed. Without reported transcription error rates, ablation on structured versus free text, or side-by-side comparisons against strong text-only baselines on the same tasks, it is hard to know whether the reported improvements come from better memory or from other factors. The stress-test concern about potential systematic loss in the OCR step therefore still applies based on what is shown. This paper is aimed at people building or extending LLM agents for extended interactive work. Readers who care about memory scaling would get a concrete idea to try, even if the current evidence is preliminary. It deserves peer review because the problem is central and the optical angle is distinct enough to warrant closer scrutiny, though the authors would need to add fidelity checks and stronger controls before the results can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes OCR-Memory, a framework for long-horizon LLM agent memory that renders historical trajectories as images annotated with unique visual identifiers and retrieves relevant experience via a locate-and-transcribe paradigm (visual anchor selection followed by verbatim OCR transcription). This is claimed to enable retention of arbitrarily long histories with minimal prompt overhead while reducing hallucination relative to text summarization or free-form generation. Experiments on long-horizon agent benchmarks report consistent gains under strict context limits.

Significance. If the optical encoding and retrieval preserve information fidelity, the approach could meaningfully expand effective memory capacity for autonomous agents by exploiting visual density and avoiding token-budget trade-offs, with potential applicability to interactive settings where raw trajectory reuse is otherwise prohibitive.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the claim of 'consistent gains' under context limits supplies no details on baselines, controls, error bars, statistical significance, or exclusion criteria, preventing verification that reported improvements support the central claim rather than reflecting post-hoc selection or uncontrolled variables.
[§3] §3 (Method, locate-and-transcribe): the load-bearing assumption that image rendering plus VLM-based region selection and OCR recovers structured agent data (observations, actions, JSON, coordinates) without systematic loss or hallucination is unverified; no quantitative transcription error rates or fidelity metrics are reported despite the abstract's assertion of reduced hallucination.

minor comments (2)

[§3] Clarify the precise image rendering pipeline, font scaling, compression settings, and visual identifier design to allow reproducibility.
[Discussion] Add a limitations section discussing failure modes of the VLM in anchor detection or transcription on non-textual or densely formatted trajectories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in the manuscript. We provide point-by-point responses below and commit to revisions that address the concerns raised.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of 'consistent gains' under context limits supplies no details on baselines, controls, error bars, statistical significance, or exclusion criteria, preventing verification that reported improvements support the central claim rather than reflecting post-hoc selection or uncontrolled variables.

Authors: We agree that the current manuscript lacks sufficient experimental details to fully substantiate the 'consistent gains' claim. In the revised version, we will substantially expand Section 4 to include complete descriptions of all baselines and controls, error bars computed over multiple independent runs, results of statistical significance tests (such as paired t-tests with p-values), and explicit exclusion criteria. The abstract will be updated to reference these additions. These changes will allow readers to independently verify that the improvements are robust and not due to uncontrolled factors. revision: yes
Referee: [§3] §3 (Method, locate-and-transcribe): the load-bearing assumption that image rendering plus VLM-based region selection and OCR recovers structured agent data (observations, actions, JSON, coordinates) without systematic loss or hallucination is unverified; no quantitative transcription error rates or fidelity metrics are reported despite the abstract's assertion of reduced hallucination.

Authors: We acknowledge that direct quantitative verification of transcription fidelity is missing from the original submission, even though end-to-end benchmark gains provide indirect support. In the revision, we will add a new analysis subsection (or appendix) reporting quantitative metrics, including character error rates and exact-match accuracy for OCR transcription on sampled trajectories, fidelity scores for recovery of structured elements (JSON, coordinates, actions), and a direct comparison of hallucination rates against text-summarization baselines. This will provide explicit evidence for the locate-and-transcribe paradigm's reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework proposal with no derivations or self-referential reductions

full rationale

The paper introduces OCR-Memory as a new encoding and retrieval method for agent histories by rendering trajectories as annotated images and applying locate-and-transcribe retrieval. No equations, derivations, fitted parameters, or first-principles claims are present. The central claims rest on the design description and experimental validation on long-horizon benchmarks rather than any reduction of results to inputs by construction, self-citations, or ansatzes. This matches the default expectation of non-circularity for descriptive system papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the unstated assumptions that visual rendering faithfully captures trajectory details and that locate-and-transcribe avoids hallucination; no explicit free parameters, axioms, or invented physical entities are described.

pith-pipeline@v0.9.0 · 5496 in / 1168 out tokens · 43295 ms · 2026-05-07T10:41:59.464274+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
cs.AI 2026-05 unverdicted novelty 6.0

ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.

Reference graph

Works this paper leans on

33 extracted references · 27 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. https://arxiv.org/abs/2310.11511 Self-rag: Learning to retrieve, generate, and critique through self-reflection . Preprint, arXiv:2310.11511

work page internal anchor Pith review arXiv 2023
[2]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://arxiv.org/abs/2308.14508 Longbench: A bilingual, multitask benchmark for long context understanding . Preprint, arXiv:2308.14508

work page internal anchor Pith review arXiv 2024
[3]

Howard Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. 2023. https://arxiv.org/abs/2310.05029 Walking down the memory maze: Beyond context limit through interactive reading

work page arXiv 2023
[4]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. https://arxiv.org/abs/2306.06070 Mind2web: Towards a generalist agent for the web . In NeurIPS

work page arXiv 2023
[5]

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J "u rgen Schmidhuber. 2024 a . https://openreview.net/forum?id=VtmBAGCN7o Metagpt: Meta programming for a multi-agent collaborative framework . In Proceedings of ...

2024
[6]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. 2024 b . https://arxiv.org/abs/2312.08914 Cogagent: A visual language model for gui agents . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

work page arXiv 2024
[7]

Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, and Hang Zhao. 2023. https://arxiv.org/abs/2306.03901 Chatdb: Augmenting llms with databases as their symbolic memory

work page arXiv 2023
[8]

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. https://arxiv.org/abs/2310.05736 Llmlingua: Compressing prompts for accelerated inference of large language models . In EMNLP

work page arXiv 2023
[9]

ACON : Optimizing context compression for long-horizon LLM agents, 2025

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. 2025. https://arxiv.org/abs/2510.00615 ACON : Optimizing context compression for long-horizon llm agents . arXiv preprint arXiv:2510.00615

work page arXiv 2025
[10]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. https://arxiv.org/abs/2005.11401 Retrieval-augmented generation for knowledge-intensive nlp tasks . Preprint, arXiv:2005.11401

work page internal anchor Pith review arXiv 2021
[11]

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.391 Compressing context to enhance inference efficiency of large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342--6353, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.391 2023
[12]

Available: https://doi.org/10.1162/tacl a 00449

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. https://arxiv.org/abs/2307.03172 Lost in the middle: How language models use long contexts . Transactions of the Association for Computational Linguistics

work page internal anchor Pith review arXiv 2024
[13]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2310.08560 MemGPT : Towards llms as operating systems . arXiv preprint arXiv:2310.08560

work page internal anchor Pith review arXiv 2023
[14]

O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. https://doi.org/10.1145/3586183.3606763 Generative agents: Interactive simulacra of human behavior . In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1--22

work page doi:10.1145/3586183.3606763 2023
[15]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. 2024. https://arxiv.org/abs/2401.18059 Raptor: Recursive abstractive processing for tree-organized retrieval . Preprint, arXiv:2401.18059

work page arXiv 2024
[16]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. https://arxiv.org/abs/2303.11366 Reflexion: Language agents with verbal reinforcement learning . In Advances in Neural Information Processing Systems

work page internal anchor Pith review arXiv 2023
[17]

Cognitive architectures for language agents, 2024

Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. 2024. https://arxiv.org/abs/2309.02427 Cognitive architectures for language agents

work page arXiv 2024
[18]

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. 2024. https://arxiv.org/abs/2407.18901 Appworld: A controllable world of apps and people for benchmarking interactive coding agents . In Proceedings of the 62nd Annual Meeting of the Association for Computation...

work page arXiv 2024
[19]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. https://openreview.net/forum?id=ehfRiF0R3a Voyager: An open-ended embodied agent with large language models . Transactions on Machine Learning Research

2023
[20]

Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2024. https://arxiv.org/abs/2409.07429 Agent workflow memory . arXiv preprint arXiv:2409.07429

work page arXiv 2024
[21]

Haoran Wei, Yaofeng Sun, and Yukun Li. 2025. https://arxiv.org/abs/2510.18234 DeepSeek-OCR : Contexts optical compression . arXiv preprint arXiv:2510.18234

work page internal anchor Pith review arXiv 2025
[22]

Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. 2023. https://arxiv.org/abs/2309.16292 Dilu: A knowledge-driven approach to autonomous driving with large language models . Preprint, arXiv:2309.16292

work page arXiv 2023
[23]

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, and 10 others. 2023. https://arxiv.org/abs/2309.07864 The rise and potential of large language model based agents: A survey...

work page internal anchor Pith review arXiv 2023
[24]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. https://arxiv.org/abs/2309.17453 Efficient streaming language models with attention sinks . In International Conference on Learning Representations

work page internal anchor Pith review arXiv 2024
[25]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, and Yunpu Ma. 2025. https://arxiv.org/abs/2508.19828 Memory- R 1: Enhancing large language model agents to manage and utilize memories via reinforcement learning . arXiv preprint arXiv:2508.19828

work page internal anchor Pith review arXiv 2025
[26]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://aclanthology.org/D18-1259/ HotpotQA : A dataset for diverse, explainable multi-hop question answering . In Conference on Empirical Methods in Natural Language Processing ( EMNLP )

2018
[27]

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. https://arxiv.org/abs/2312.13771 Appagent: Multimodal agents as smartphone users . Preprint, arXiv:2312.13771

work page arXiv 2023
[28]

Guibin Zhang, Muxin Fu, and Shuicheng Yan. 2025. https://arxiv.org/abs/2509.24704 Memgen: Weaving generative latent memory for self-evolving agents . arXiv preprint arXiv:2509.24704

work page arXiv 2025
[29]

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jie Liu, and Gao Huang. 2024. https://ojs.aaai.org/index.php/AAAI/article/view/29936 Expel: Llm agents are experiential learners . Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19716--19723

2024
[30]

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. https://doi.org/10.1609/aaai.v38i17.29946 Memorybank: Enhancing large language models with long-term memory . In Proceedings of the AAAI Conference on Artificial Intelligence, pages 19724--19731

work page doi:10.1609/aaai.v38i17.29946 2024
[31]

Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. 2023. https://arxiv.org/abs/2305.17144 Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory . In Advances in Neu...

work page arXiv 2023
[32]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[33]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...