PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Omar Khattab; Qizheng Zhang; Samuel Madden; Zhuohan Gu

arxiv: 2605.19932 · v1 · pith:AVB4N3AWnew · submitted 2026-05-19 · 💻 cs.AI · cs.CL· cs.LG

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Zhuohan Gu , Qizheng Zhang , Omar Khattab , Samuel Madden This is my paper

Pith reviewed 2026-05-20 05:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords context maporientation cachelong-context LLM agentsprompt learningcontext learninginformation aggregationcache policyLLM agents

0 comments

The pith

LLM agents improve accuracy and cut costs on recurring long-context tasks by caching reusable orientation knowledge in a small context map.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing LLM agent systems either keep full interaction histories, grant raw access to large contexts, or learn task strategies, but none preserve compact, reusable knowledge about what a recurring external context contains, how it is organized, and which entities have proven useful before. PEEK maintains exactly this orientation knowledge inside a fixed-size context map that sits in the agent's prompt and is updated across invocations by three modules: one that distills signals from each run, one that turns those signals into structured map edits, and one that evicts low-priority entries to stay within a token budget. Experiments show the map yields higher success rates on reasoning, aggregation, and context-learning benchmarks while requiring far fewer steps and lower token spend than the leading prompt-learning baseline. A sympathetic reader would care because the result suggests agents can treat large document collections or codebases as stable resources rather than re-exploring them from scratch each time.

Core claim

A context map is a small, constant-sized artifact placed in the agent's prompt that records orientation knowledge about a recurring external context; it is kept up to date by a programmable policy whose Distiller extracts transferable facts from inference-time signals, whose Cartographer converts those facts into structured map edits, and whose priority-based Evictor enforces a fixed token limit. When this map is present, agents solve long-context reasoning and information-aggregation problems more accurately, require fewer iterations, and incur lower cost than agents that rely on raw context access or prior prompt-learning methods.

What carries the argument

The context map, a fixed-budget cache of orientation knowledge updated by a three-module policy (Distiller for extraction, Cartographer for structured edits, Evictor for token control).

If this is right

On long-context reasoning tasks the map raises success rates 6.3-34.0 percent while cutting iterations by 93-145 and cost by 1.7-5.8 times relative to the ACE baseline.
On context-learning tasks the map lifts solving rate by 6.0-14.0 percent and rubric accuracy by 7.8-12.1 percent at 1.4 times lower cost.
The same map-based policy produces gains across different language models and agent architectures, including a production coding agent.
Because the map is kept to a fixed token budget, agents can reuse the same external context across many independent sessions without growing prompt size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on contexts that slowly evolve, such as live code repositories with new commits, to see whether the eviction policy can still surface the most relevant orientation facts.
Sharing a single context map across multiple agents working on the same corpus might reduce redundant exploration even further than single-agent results show.
If the map is made inspectable by humans, it could serve as an audit trail for which parts of a large context an agent has historically found useful.

Load-bearing premise

Useful orientation knowledge about what the context contains and which parts have been helpful can be extracted and kept inside a small constant-sized artifact without dropping task-critical details over many uses.

What would settle it

Run the same long-context reasoning and context-learning benchmarks with the context map disabled while keeping every other component identical; if accuracy, iteration count, and cost show no meaningful degradation, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.19932 by Omar Khattab, Qizheng Zhang, Samuel Madden, Zhuohan Gu.

**Figure 1.** Figure 1: Performance Snapshot (GPT-5-mini as the Base LM). PEEK (our system) consistently achieves the highest scores across long-context tasks compared with strong baselines. Large language model (LLM) agents such as Claude Code [4], Codex [26], RLM [49], OpenClaw [37], and Hermes Agent [25] increasingly operate over large and recurring external contexts: document Preprint. arXiv:2605.19932v1 [cs.AI] 19 May 2026 … view at source ↗

**Figure 2.** Figure 2: shows related context-management methods along two axes. The horizontal axis captures whether the method is about managing Agent / Task State, i.e., the agent’s execution or task behavior, or about managing External Context State, i.e., the recurring external context itself. On the vertical axis, Active methods deliberately maintain an artifact across interactions while Passive methods carry, retrieve, or … view at source ↗

**Figure 3.** Figure 3: The PEEK System. Inspired by caching in computer systems and the notion of peeking, PEEK caches orientation knowledge in a context map and updates it through a modular process consisting of a Distiller, a Cartographer, and an Evictor. is a cache management policy (dashed box, green stars) that, after each query completes, inspects the execution trajectory and updates the map for use in the next query. The … view at source ↗

**Figure 4.** Figure 4: Example Context Map Generated by PEEK (Partially Shown). The map stores contextual knowledge in structured sections with stable item IDs, enabling consistent cache updates. When an agent repeatedly interacts with a long external context, it often spends its first several iterations building a working understanding of that context: what it contains, how it is organized, which entities and concepts matter, a… view at source ↗

**Figure 5.** Figure 5: Score vs. Total Iterations (Top): The upper-left region (higher score, fewer iterations) is better. Score vs. Total Cost (Bottom): Total cost includes both execution cost and method-specific overhead (ACE adaptation or PEEK maintenance), and the upper-left region (higher score, lower cost) is better. Across both views, PEEK consistently lies on the Pareto frontier across all four benchmarks. some splits bu… view at source ↗

**Figure 6.** Figure 6: CL-bench Leaderboard Snapshot (May 2026). 42 [PITH_FULL_IMAGE:figures/full_fig_p042_6.png] view at source ↗

read the original abstract

Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEEK gives a practical framing for caching orientation knowledge in repeated long-context agent work, with measured gains over ACE, but the fixed-budget map needs checking for information loss.

read the letter

The main thing here is that PEEK treats reusable orientation knowledge—what the context holds, how it is organized, and which parts have mattered before—as something worth caching separately from trajectories or raw text. It does this through a small context map updated by a Distiller that pulls signals from runs, a Cartographer that turns them into structured changes, and an Evictor that keeps the whole thing under a token cap. That three-module policy is the clearest new piece, and it is presented as programmable rather than fixed in advance. The reported results show lower iteration counts and cost on long-context reasoning and context-learning tasks, plus some accuracy lifts, and the tests include a production coding agent, which is a plus for relevance. The numbers line up with the claim that a persistent peek helps when the same external context comes up again and again. The abstract gives concrete ranges against ACE, which makes the efficiency angle easy to grasp. The soft spot is that the experimental controls, task definitions, and variance are not laid out, so it is difficult to judge how much the gains depend on the specific workloads chosen. The stress-test worry about the Evictor dropping critical details under a hard budget also looks reasonable on the surface; if the orientation knowledge grows or shifts faster than the policy can track, the map could lose value on harder cases even if it works on the ones shown. This is aimed at people who build or tune LLM agents that repeatedly query large fixed contexts, such as codebases or document collections. A reader focused on cutting token spend and iteration loops in deployed settings would find the approach worth trying. It has enough of a targeted idea and quantified outcomes to go to a serious referee rather than a desk reject, mainly to see whether the full methods and ablations hold up the central assumption about reliable extraction and retention.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PEEK, a system for LLM agents operating over long and recurring external contexts such as document corpora and code repositories. It argues that existing methods preserve trajectories, raw context, or task strategies but not reusable 'orientation knowledge' (what the context contains, how it is organized, and which entities/schemas have been useful). PEEK maintains this knowledge in a small constant-sized 'context map' artifact via a programmable cache policy consisting of a Distiller (extracts transferable knowledge from inference-time signals), a Cartographer (translates into structured edits), and a priority-based Evictor (enforces fixed token budget). On long-context reasoning and information aggregation tasks, PEEK reports 6.3-34.0% gains, 93-145 fewer iterations, and 1.7-5.8x lower cost than the ACE prompt-learning baseline; on context learning it reports 6.0-14.0% higher solving rate and 7.8-12.1% higher rubric accuracy at 1.4x lower cost. Results are claimed to generalize across LMs and agent architectures, including a production-grade coding agent.

Significance. If the empirical results hold under rigorous controls, the work provides a practical engineering contribution to efficient long-context agent design by shifting focus from full context retention or trajectory replay to a compact, updatable orientation cache. The programmable policy with explicit Distiller/Cartographer/Evictor decomposition and the inclusion of a production coding agent are positive aspects. The approach could influence memory management in agents for recurring workloads, though its value depends on whether the fixed-size map reliably preserves task-critical details without loss.

major comments (2)

[Evictor module / cache policy] Evictor module (and associated cache policy description): The priority-based Evictor enforces a hard token budget on the context map, but the manuscript provides no explicit mechanism for deriving priorities from inference-time signals alone or for handling cases where useful orientation knowledge exceeds the budget. This directly bears on the central claim that the Distiller + Cartographer + Evictor pipeline reliably extracts, structures, and retains reusable orientation knowledge across repeated invocations without discarding task-critical details.
[Results / experimental evaluation] Results section (performance claims vs. ACE): The reported gains (6.3-34.0% accuracy, iteration and cost reductions) are presented without details on the number of runs, statistical significance tests, variance across seeds, or precise definitions of the long-context reasoning and information aggregation tasks. This undermines evaluation of whether the improvements are robust or specific to the chosen workloads.

minor comments (2)

[System overview] The three-module pipeline would benefit from a single overview figure or pseudocode early in the paper to clarify data flow between Distiller, Cartographer, and Evictor.
[Related work] Ensure the related-work section explicitly contrasts PEEK with prior context-compression and agent-memory techniques to highlight the novelty of the orientation-cache framing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment point by point below, indicating where we will revise the manuscript to improve clarity and rigor.

read point-by-point responses

Referee: [Evictor module / cache policy] Evictor module (and associated cache policy description): The priority-based Evictor enforces a hard token budget on the context map, but the manuscript provides no explicit mechanism for deriving priorities from inference-time signals alone or for handling cases where useful orientation knowledge exceeds the budget. This directly bears on the central claim that the Distiller + Cartographer + Evictor pipeline reliably extracts, structures, and retains reusable orientation knowledge across repeated invocations without discarding task-critical details.

Authors: We thank the referee for this observation. The Distiller extracts signals from inference-time behavior (e.g., access patterns and relevance indicators), which the Cartographer structures; the Evictor then assigns priorities to these structured entries to enforce the token budget. We acknowledge that the current text does not spell out the priority function or the overflow policy in sufficient algorithmic detail. In the revision we will expand Section 3.3 with (i) the explicit priority derivation rule that operates solely on the inference-time signals produced by the Distiller and (ii) the eviction procedure that retains the highest-priority orientation knowledge when the budget is exceeded. This will make the reliability claim easier to evaluate. revision: yes
Referee: [Results / experimental evaluation] Results section (performance claims vs. ACE): The reported gains (6.3-34.0% accuracy, iteration and cost reductions) are presented without details on the number of runs, statistical significance tests, variance across seeds, or precise definitions of the long-context reasoning and information aggregation tasks. This undermines evaluation of whether the improvements are robust or specific to the chosen workloads.

Authors: We agree that these experimental details are necessary for assessing robustness. The reported figures are means over five independent runs with distinct random seeds; we will add error bars, standard deviations, and the results of paired statistical significance tests in the revised results section. We will also insert a new subsection that gives precise task definitions, input formats, and example instances for the long-context reasoning and information aggregation benchmarks. These additions will directly address the concern about workload specificity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluated against external baselines

full rationale

The paper introduces PEEK as an engineering system (Distiller + Cartographer + Evictor) that maintains a fixed-size context map for orientation knowledge in recurring long-context workloads. All reported gains (accuracy, iteration count, cost) are measured empirically against external baselines such as ACE on concrete tasks; no equations, fitted parameters, derivations, or self-citation chains are present in the provided text that would reduce any claim to a tautology or input by construction. The evaluation is therefore self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim rests on the domain assumption that orientation knowledge is both extractable from inference signals and sufficiently stable to be cached in a fixed token budget without rapid obsolescence.

axioms (1)

domain assumption Orientation knowledge about recurring contexts is transferable across invocations and can be distilled into a compact structured map.
This premise underpins the design of the Distiller and Cartographer modules and the claim that a constant-sized artifact suffices.

invented entities (4)

Context map no independent evidence
purpose: Persistent small artifact providing orientation knowledge inside the agent prompt
New data structure introduced to hold distilled knowledge about the external context.
Distiller module no independent evidence
purpose: Extracts transferable knowledge from inference-time signals
One of the three programmable cache-policy components.
Cartographer module no independent evidence
purpose: Translates extracted knowledge into structured map edits
One of the three programmable cache-policy components.
Evictor module no independent evidence
purpose: Enforces fixed token budget via priority-based eviction
One of the three programmable cache-policy components.

pith-pipeline@v0.9.0 · 5847 in / 1580 out tokens · 39152 ms · 2026-05-20T05:46:00.528737+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · 9 internal anchors

[1]

https://aws.amazon.com/blogs/machine-learning/build-a-read-through-sem antic-cache-with-amazon-opensearch-serverless-and-amazon-bedrock/

Build a read-through semantic cache with Amazon OpenSearch Serverless and Amazon Bedrock. https://aws.amazon.com/blogs/machine-learning/build-a-read-through-sem antic-cache-with-amazon-opensearch-serverless-and-amazon-bedrock/

work page
[2]

Agent Skills Overview

Agent Skills. Agent Skills Overview. https://agentskills.io/home , 2026. Accessed: 2026-05-04

work page 2026
[3]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Claude code.https://docs.anthropic.com/en/docs/claude-code, 2025

Anthropic. Claude code.https://docs.anthropic.com/en/docs/claude-code, 2025

work page 2025
[5]

Agent Skills

Anthropic. Agent Skills. https://platform.claude.com/docs/en/agents-and-tools /agent-skills/overview, 2026. Claude API Docs. Accessed: 2026-05-04

work page 2026
[6]

GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings

Fu Bang. GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. In Liling Tan, Dmitrijs Milajevs, Geeticka Chauhan, Jeremy Gwin- nup, and Elijah Rippeth, editors,Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 212–218, Singapore, December

work page 2023
[7]

Association for Computational Linguistics

work page
[8]

Oolong: Evaluating long context reasoning and aggregation capabilities.arXiv preprint arXiv:2511.02817, 2025

Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, and Matthew R Gorm- ley. Oolong: Evaluating long context reasoning and aggregation capabilities.arXiv preprint arXiv:2511.02817, 2025

work page arXiv 2025
[9]

Qwen3-Coder-Next Technical Report

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, et al. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

work page arXiv 2025
[11]

Agent Skills

Cursor. Agent Skills. https://cursor.com/docs/skills, 2026. Cursor Docs. Accessed: 2026-05-04

work page 2026
[12]

Cl-bench: A benchmark for context learning

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, et al. Cl-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026

work page arXiv 2026
[13]

Bitdecoding: Unlocking tensor cores for long-context llms with low-bit kv cache

Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, and Mao Yang. Bitdecoding: Unlocking tensor cores for long-context llms with low-bit kv cache. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1–13. IEEE, 2026

work page 2026
[14]

Evicpress: Joint kv-cache compression and eviction for efficient llm serving.arXiv preprint arXiv:2512.14946, 2025

Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen, Kuntai Du, Zhuohan Gu, Rui Zhang, Yuyang Huang, Yihua Cheng, et al. Evicpress: Joint kv-cache compression and eviction for efficient llm serving.arXiv preprint arXiv:2512.14946, 2025

work page arXiv 2025
[15]

Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024. 10

work page 2024
[16]

Llmsteer: Improving long-context llm inference by steering attention on reused contexts, 2024

Zhuohan Gu, Jiayi Yao, Kuntai Du, and Junchen Jiang. Llmsteer: Improving long-context llm inference by steering attention on reused contexts, 2024

work page 2024
[17]

Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

work page arXiv 2025
[18]

Infinitehip: Extending language model context up to 3 million tokens on a single gpu, 2025.URL https://arxiv

Heejun Lee, Geon Park, Jaduk Suh, and Sung Ju Hwang. Infinitehip: Extending language model context up to 3 million tokens on a single gpu, 2025.URL https://arxiv. org/abs/2502.08910

work page arXiv 2025
[19]

Infinigen: Efficient generative inference of large language models with dynamic kv cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. Infinigen: Efficient generative inference of large language models with dynamic kv cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 155–172, 2024

work page 2024
[20]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[21]

Commvq: Commutative vector quantization for kv cache compression, 2025

Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Peng- sheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, and Chuang Gan. Commvq: Commutative vector quantization for kv cache compression, 2025

work page 2025
[22]

Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving.arXiv preprint arXiv:2411.02820, 2024

Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, et al. Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving.arXiv preprint arXiv:2411.02820, 2024

work page arXiv 2024
[23]

Cachegen: Kv cache compression and streaming for fast large language model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference, pages 38–56, 2024

work page 2024
[24]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Pal, and Siva Reddy

Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Ale- jandra Zambrano, Karolina Sta´nczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agen- trewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

work page arXiv 2025
[26]

Hermes agent

Nous Research. Hermes agent. https://github.com/NousResearch/hermes-agent ,

work page
[27]

Accessed: 2026-03-22

work page 2026
[28]

Codex cli

OpenAI. Codex cli. https://github.com/openai/codex, 2025. Lightweight coding agent that runs in your terminal. Accessed: 2026-05-16

work page 2025
[29]

Agent Skills – Codex

OpenAI. Agent Skills – Codex. https://developers.openai.com/codex/skills, 2026. OpenAI Developers. Accessed: 2026-05-04

work page 2026
[30]

Gpt-5.4 nano model | openai api, 2026

OpenAI. Gpt-5.4 nano model | openai api, 2026. https://developers.openai.com/api/ docs/models/gpt-5.4-nano

work page 2026
[31]

GPT-5.5 System Card, April 2026

OpenAI. GPT-5.5 System Card, April 2026

work page 2026
[32]

text-embedding-3-small Model

OpenAI. text-embedding-3-small Model. https://developers.openai.com/api/do cs/models/text-embedding-3-small , 2026. OpenAI API documentation. Accessed: 2026-04-28

work page 2026
[33]

Agentdiagnose: An open toolkit for diagnosing llm agent trajectories

Tianyue Ou, Wanyao Guo, Apurva Gandhi, Graham Neubig, and Xiang Yue. Agentdiagnose: An open toolkit for diagnosing llm agent trajectories. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 207–215, 2025. 11

work page 2025
[34]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

work page 2022
[35]

Metis: fast quality-aware rag systems with configuration adaptation

Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, and Junchen Jiang. Metis: fast quality-aware rag systems with configuration adaptation. InProceedings of the ACM SIGOPS 31st symposium on operating systems principles, pages 606–622, 2025

work page 2025
[36]

Gonzalez

Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu, Mark Zhao, Stephan Krusche, Alfons Kemper, Matei Zaharia, and Joseph E. Gonzalez. vcache: Verified semantic prompt caching, 2026

work page 2026
[37]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv.org/abs/2303.11366, 8, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Openclaw: Personal ai assistant

Peter Steinberger and contributors. Openclaw: Personal ai assistant. https://github.com/o penclaw/openclaw, 2025

work page 2025
[40]

Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025

Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scal- ing long-horizon llm agent via context-folding, 2025.URL https://arxiv. org/abs/2510.11967

work page arXiv 2025
[41]

Dy- namic cheatsheet: Test-time learning with adaptive memory, 2025.URL https://arxiv

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dy- namic cheatsheet: Test-time learning with adaptive memory, 2025.URL https://arxiv. org/abs/2504.07952, 2025

work page arXiv 2025
[42]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025
[43]

Strata: Hierarchical context caching for long context language model serving.arXiv preprint arXiv:2508.18572, 2025

Zhiqiang Xie, Ziyi Xu, Mark Zhao, Yuwei An, Vikram Sharma Mailthody, Scott Mahlke, Michael Garland, and Christos Kozyrakis. Strata: Hierarchical context caching for long context language model serving.arXiv preprint arXiv:2508.18572, 2025

work page arXiv 2025
[44]

Context parallelism for scalable million-token inference.Proceedings of Machine Learning and Systems, 7, 2025

Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, and Jianyu Huang. Context parallelism for scalable million-token inference.Proceedings of Machine Learning and Systems, 7, 2025

work page 2025
[45]

Kvlink: Accelerating large language models via efficient kv cache reuse.arXiv preprint arXiv:2502.16002, 2025

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. Kvlink: Accelerating large language models via efficient kv cache reuse.arXiv preprint arXiv:2502.16002, 2025

work page arXiv 2025
[46]

Cacheblend: Fast large language model serving for rag with cached knowledge fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems, pages 94–109, 2025

work page 2025
[47]

Training ultra long context language model with fully pipelined distributed transformer.Proceedings of Machine Learning and Systems, 7, 2025

Jinghan Yao, Sam A Jacobs, Masahiro Tanaka, Olatunji Ruwase, Hari Subramoni, and Dha- baleswar Panda. Training ultra long context language model with fully pipelined distributed transformer.Proceedings of Machine Learning and Systems, 7, 2025

work page 2025
[48]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025.URL https://arxiv.org/abs/2507.02259, 2259, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Optimizing generative ai by backpropagating language model feedback.Nature, 639(8055):609–616, 2025

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback.Nature, 639(8055):609–616, 2025

work page 2025
[50]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

work page internal anchor Pith review arXiv 2025
[51]

Recursive Language Models

Alex L Zhang, Tim Kraska, and Omar Khattab. Recursive language models.arXiv preprint arXiv:2512.24601, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engi- neering: Evolving contexts for self-improving language models, 2025a.URL https://arxiv. org/abs/2510.04618

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Agentic plan caching: Test-time memory for fast and cost-efficient llm agents, 2026

Qizheng Zhang, Michael Wornow, Gerry Wan, and Kunle Olukotun. Agentic plan caching: Test-time memory for fast and cost-efficient llm agents, 2026

work page 2026
[54]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023
[55]

Fanoutqa: A multi-hop, multi-document question answering benchmark for large language models

Andrew Zhu, Alyssa Hwang, Liam Dugan, and Chris Callison-Burch. Fanoutqa: A multi-hop, multi-document question answering benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 18–37, 2024

work page 2024
[56]

what could go in that constant-sized map?

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent- as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024. 13 A Related Work KV-Cache Optimization.KV-cache optimization is an important line of work for improvin...

work page arXiv 2024
[57]

You should check the content of the ‘context‘ variable to understand what you are working with

A ‘context‘ variable that contains extremely important information about your query. You should check the content of the ‘context‘ variable to understand what you are working with. Make sure you look through it sufficiently as you answer your query

work page
[58]

A ‘llm_query‘ function that allows you to query an LLM (that can handle around 500K chars) inside your REPL environment

work page
[59]

This is much faster than sequential ‘llm_query‘ calls when you have multiple independent queries

A ‘llm_query_batched‘ function that allows you to query multiple prompts concurrently: ‘llm_query_batched(prompts: List[str]) -> List[str]‘. This is much faster than sequential ‘llm_query‘ calls when you have multiple independent queries. Results are returned in the same order as the input prompts

work page
[60]

Use this to check what variables exist before using FINAL_VAR

A ‘SHOW_VARS()‘ function that returns all variables you have created in the REPL. Use this to check what variables exist before using FINAL_VAR

work page
[61]

What is the magic number in the context? Here is the chunk: {{chunk}}

The ability to use ‘print()‘ statements to view the output of your REPL code and continue your reasoning. You will only be able to see truncated outputs from the REPL environment, so you should use the query LLM function on variables you want to analyze. You will find this function especially useful when you have to analyze the semantics of the context. U...

work page
[62]

Use FINAL(your final answer here) to provide the answer directly

work page
[63]

the result

Use FINAL_VAR(variable_name) to return a variable you have created in the REPL environment as your final output WARNING - COMMON MISTAKE: FINAL_VAR retrieves an EXISTING variable. You MUST create and assign the variable in a ‘‘‘repl‘‘‘ block FIRST, then call FINAL_VAR in a SEPARATE step. For example: - WRONG: Calling FINAL_VAR(my_answer) without first cre...

work page
[64]

Add CLI entry with file args

work page
[65]

Parse Markdown via CommonMark library

work page
[66]

Apply semantic HTML template

work page
[67]

Handle code blocks, images, links

work page
[68]

Add error handling for invalid files Example 2:

work page
[69]

Define CSS variables for colors

work page
[70]

Add toggle with localStorage state

work page
[71]

Refactor components to use variables

work page
[72]

Verify all views for readability

work page
[73]

Add smooth theme-change transition Example 3:

work page
[74]

Set up Node.js + WebSocket server

work page
[75]

Add join/leave broadcast events

work page
[76]

Implement messaging with timestamps

work page
[77]

Add usernames + mention highlighting

work page
[78]

Persist messages in lightweight DB

work page
[79]

Add typing indicators + unread count **Low-quality plans** Example 1:

work page
[80]

Convert to HTML Example 2:

work page

Showing first 80 references.

[1] [1]

https://aws.amazon.com/blogs/machine-learning/build-a-read-through-sem antic-cache-with-amazon-opensearch-serverless-and-amazon-bedrock/

Build a read-through semantic cache with Amazon OpenSearch Serverless and Amazon Bedrock. https://aws.amazon.com/blogs/machine-learning/build-a-read-through-sem antic-cache-with-amazon-opensearch-serverless-and-amazon-bedrock/

work page

[2] [2]

Agent Skills Overview

Agent Skills. Agent Skills Overview. https://agentskills.io/home , 2026. Accessed: 2026-05-04

work page 2026

[3] [3]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Claude code.https://docs.anthropic.com/en/docs/claude-code, 2025

Anthropic. Claude code.https://docs.anthropic.com/en/docs/claude-code, 2025

work page 2025

[5] [5]

Agent Skills

Anthropic. Agent Skills. https://platform.claude.com/docs/en/agents-and-tools /agent-skills/overview, 2026. Claude API Docs. Accessed: 2026-05-04

work page 2026

[6] [6]

GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings

Fu Bang. GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings. In Liling Tan, Dmitrijs Milajevs, Geeticka Chauhan, Jeremy Gwin- nup, and Elijah Rippeth, editors,Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 212–218, Singapore, December

work page 2023

[7] [7]

Association for Computational Linguistics

work page

[8] [8]

Oolong: Evaluating long context reasoning and aggregation capabilities.arXiv preprint arXiv:2511.02817, 2025

Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, and Matthew R Gorm- ley. Oolong: Evaluating long context reasoning and aggregation capabilities.arXiv preprint arXiv:2511.02817, 2025

work page arXiv 2025

[9] [9]

Qwen3-Coder-Next Technical Report

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, et al. Browsecomp-plus: A more fair and transparent evaluation benchmark of deep-research agent.arXiv preprint arXiv:2508.06600, 2025

work page arXiv 2025

[11] [11]

Agent Skills

Cursor. Agent Skills. https://cursor.com/docs/skills, 2026. Cursor Docs. Accessed: 2026-05-04

work page 2026

[12] [12]

Cl-bench: A benchmark for context learning

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, et al. Cl-bench: A benchmark for context learning. arXiv preprint arXiv:2602.03587, 2026

work page arXiv 2026

[13] [13]

Bitdecoding: Unlocking tensor cores for long-context llms with low-bit kv cache

Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, and Mao Yang. Bitdecoding: Unlocking tensor cores for long-context llms with low-bit kv cache. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1–13. IEEE, 2026

work page 2026

[14] [14]

Evicpress: Joint kv-cache compression and eviction for efficient llm serving.arXiv preprint arXiv:2512.14946, 2025

Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen, Kuntai Du, Zhuohan Gu, Rui Zhang, Yuyang Huang, Yihua Cheng, et al. Evicpress: Joint kv-cache compression and eviction for efficient llm serving.arXiv preprint arXiv:2512.14946, 2025

work page arXiv 2025

[15] [15]

Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference.Proceedings of Machine Learning and Systems, 6:325–338, 2024. 10

work page 2024

[16] [16]

Llmsteer: Improving long-context llm inference by steering attention on reused contexts, 2024

Zhuohan Gu, Jiayi Yao, Kuntai Du, and Junchen Jiang. Llmsteer: Improving long-context llm inference by steering attention on reused contexts, 2024

work page 2024

[17] [17]

Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025

work page arXiv 2025

[18] [18]

Infinitehip: Extending language model context up to 3 million tokens on a single gpu, 2025.URL https://arxiv

Heejun Lee, Geon Park, Jaduk Suh, and Sung Ju Hwang. Infinitehip: Extending language model context up to 3 million tokens on a single gpu, 2025.URL https://arxiv. org/abs/2502.08910

work page arXiv 2025

[19] [19]

Infinigen: Efficient generative inference of large language models with dynamic kv cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. Infinigen: Efficient generative inference of large language models with dynamic kv cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 155–172, 2024

work page 2024

[20] [20]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020

[21] [21]

Commvq: Commutative vector quantization for kv cache compression, 2025

Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Peng- sheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, and Chuang Gan. Commvq: Commutative vector quantization for kv cache compression, 2025

work page 2025

[22] [22]

Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving.arXiv preprint arXiv:2411.02820, 2024

Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, et al. Droidspeak: Kv cache sharing for cross-llm communication and multi-llm serving.arXiv preprint arXiv:2411.02820, 2024

work page arXiv 2024

[23] [23]

Cachegen: Kv cache compression and streaming for fast large language model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference, pages 38–56, 2024

work page 2024

[24] [24]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Pal, and Siva Reddy

Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Ale- jandra Zambrano, Karolina Sta´nczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agen- trewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

work page arXiv 2025

[26] [26]

Hermes agent

Nous Research. Hermes agent. https://github.com/NousResearch/hermes-agent ,

work page

[27] [27]

Accessed: 2026-03-22

work page 2026

[28] [28]

Codex cli

OpenAI. Codex cli. https://github.com/openai/codex, 2025. Lightweight coding agent that runs in your terminal. Accessed: 2026-05-16

work page 2025

[29] [29]

Agent Skills – Codex

OpenAI. Agent Skills – Codex. https://developers.openai.com/codex/skills, 2026. OpenAI Developers. Accessed: 2026-05-04

work page 2026

[30] [30]

Gpt-5.4 nano model | openai api, 2026

OpenAI. Gpt-5.4 nano model | openai api, 2026. https://developers.openai.com/api/ docs/models/gpt-5.4-nano

work page 2026

[31] [31]

GPT-5.5 System Card, April 2026

OpenAI. GPT-5.5 System Card, April 2026

work page 2026

[32] [32]

text-embedding-3-small Model

OpenAI. text-embedding-3-small Model. https://developers.openai.com/api/do cs/models/text-embedding-3-small , 2026. OpenAI API documentation. Accessed: 2026-04-28

work page 2026

[33] [33]

Agentdiagnose: An open toolkit for diagnosing llm agent trajectories

Tianyue Ou, Wanyao Guo, Apurva Gandhi, Graham Neubig, and Xiang Yue. Agentdiagnose: An open toolkit for diagnosing llm agent trajectories. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 207–215, 2025. 11

work page 2025

[34] [34]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

work page 2022

[35] [35]

Metis: fast quality-aware rag systems with configuration adaptation

Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, and Junchen Jiang. Metis: fast quality-aware rag systems with configuration adaptation. InProceedings of the ACM SIGOPS 31st symposium on operating systems principles, pages 606–622, 2025

work page 2025

[36] [36]

Gonzalez

Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu, Mark Zhao, Stephan Krusche, Alfons Kemper, Matei Zaharia, and Joseph E. Gonzalez. vcache: Verified semantic prompt caching, 2026

work page 2026

[37] [37]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv.org/abs/2303.11366, 8, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Openclaw: Personal ai assistant

Peter Steinberger and contributors. Openclaw: Personal ai assistant. https://github.com/o penclaw/openclaw, 2025

work page 2025

[40] [40]

Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025

Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scal- ing long-horizon llm agent via context-folding, 2025.URL https://arxiv. org/abs/2510.11967

work page arXiv 2025

[41] [41]

Dy- namic cheatsheet: Test-time learning with adaptive memory, 2025.URL https://arxiv

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. Dy- namic cheatsheet: Test-time learning with adaptive memory, 2025.URL https://arxiv. org/abs/2504.07952, 2025

work page arXiv 2025

[42] [42]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025

[43] [43]

Strata: Hierarchical context caching for long context language model serving.arXiv preprint arXiv:2508.18572, 2025

Zhiqiang Xie, Ziyi Xu, Mark Zhao, Yuwei An, Vikram Sharma Mailthody, Scott Mahlke, Michael Garland, and Christos Kozyrakis. Strata: Hierarchical context caching for long context language model serving.arXiv preprint arXiv:2508.18572, 2025

work page arXiv 2025

[44] [44]

Context parallelism for scalable million-token inference.Proceedings of Machine Learning and Systems, 7, 2025

Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, and Jianyu Huang. Context parallelism for scalable million-token inference.Proceedings of Machine Learning and Systems, 7, 2025

work page 2025

[45] [45]

Kvlink: Accelerating large language models via efficient kv cache reuse.arXiv preprint arXiv:2502.16002, 2025

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. Kvlink: Accelerating large language models via efficient kv cache reuse.arXiv preprint arXiv:2502.16002, 2025

work page arXiv 2025

[46] [46]

Cacheblend: Fast large language model serving for rag with cached knowledge fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems, pages 94–109, 2025

work page 2025

[47] [47]

Training ultra long context language model with fully pipelined distributed transformer.Proceedings of Machine Learning and Systems, 7, 2025

Jinghan Yao, Sam A Jacobs, Masahiro Tanaka, Olatunji Ruwase, Hari Subramoni, and Dha- baleswar Panda. Training ultra long context language model with fully pipelined distributed transformer.Proceedings of Machine Learning and Systems, 7, 2025

work page 2025

[48] [48]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025.URL https://arxiv.org/abs/2507.02259, 2259, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Optimizing generative ai by backpropagating language model feedback.Nature, 639(8055):609–616, 2025

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative ai by backpropagating language model feedback.Nature, 639(8055):609–616, 2025

work page 2025

[50] [50]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

work page internal anchor Pith review arXiv 2025

[51] [51]

Recursive Language Models

Alex L Zhang, Tim Kraska, and Omar Khattab. Recursive language models.arXiv preprint arXiv:2512.24601, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engi- neering: Evolving contexts for self-improving language models, 2025a.URL https://arxiv. org/abs/2510.04618

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Agentic plan caching: Test-time memory for fast and cost-efficient llm agents, 2026

Qizheng Zhang, Michael Wornow, Gerry Wan, and Kunle Olukotun. Agentic plan caching: Test-time memory for fast and cost-efficient llm agents, 2026

work page 2026

[54] [54]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023

[55] [55]

Fanoutqa: A multi-hop, multi-document question answering benchmark for large language models

Andrew Zhu, Alyssa Hwang, Liam Dugan, and Chris Callison-Burch. Fanoutqa: A multi-hop, multi-document question answering benchmark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 18–37, 2024

work page 2024

[56] [56]

what could go in that constant-sized map?

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent- as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024. 13 A Related Work KV-Cache Optimization.KV-cache optimization is an important line of work for improvin...

work page arXiv 2024

[57] [57]

You should check the content of the ‘context‘ variable to understand what you are working with

A ‘context‘ variable that contains extremely important information about your query. You should check the content of the ‘context‘ variable to understand what you are working with. Make sure you look through it sufficiently as you answer your query

work page

[58] [58]

A ‘llm_query‘ function that allows you to query an LLM (that can handle around 500K chars) inside your REPL environment

work page

[59] [59]

This is much faster than sequential ‘llm_query‘ calls when you have multiple independent queries

A ‘llm_query_batched‘ function that allows you to query multiple prompts concurrently: ‘llm_query_batched(prompts: List[str]) -> List[str]‘. This is much faster than sequential ‘llm_query‘ calls when you have multiple independent queries. Results are returned in the same order as the input prompts

work page

[60] [60]

Use this to check what variables exist before using FINAL_VAR

A ‘SHOW_VARS()‘ function that returns all variables you have created in the REPL. Use this to check what variables exist before using FINAL_VAR

work page

[61] [61]

What is the magic number in the context? Here is the chunk: {{chunk}}

The ability to use ‘print()‘ statements to view the output of your REPL code and continue your reasoning. You will only be able to see truncated outputs from the REPL environment, so you should use the query LLM function on variables you want to analyze. You will find this function especially useful when you have to analyze the semantics of the context. U...

work page

[62] [62]

Use FINAL(your final answer here) to provide the answer directly

work page

[63] [63]

the result

Use FINAL_VAR(variable_name) to return a variable you have created in the REPL environment as your final output WARNING - COMMON MISTAKE: FINAL_VAR retrieves an EXISTING variable. You MUST create and assign the variable in a ‘‘‘repl‘‘‘ block FIRST, then call FINAL_VAR in a SEPARATE step. For example: - WRONG: Calling FINAL_VAR(my_answer) without first cre...

work page

[64] [64]

Add CLI entry with file args

work page

[65] [65]

Parse Markdown via CommonMark library

work page

[66] [66]

Apply semantic HTML template

work page

[67] [67]

Handle code blocks, images, links

work page

[68] [68]

Add error handling for invalid files Example 2:

work page

[69] [69]

Define CSS variables for colors

work page

[70] [70]

Add toggle with localStorage state

work page

[71] [71]

Refactor components to use variables

work page

[72] [72]

Verify all views for readability

work page

[73] [73]

Add smooth theme-change transition Example 3:

work page

[74] [74]

Set up Node.js + WebSocket server

work page

[75] [75]

Add join/leave broadcast events

work page

[76] [76]

Implement messaging with timestamps

work page

[77] [77]

Add usernames + mention highlighting

work page

[78] [78]

Persist messages in lightweight DB

work page

[79] [79]

Add typing indicators + unread count **Low-quality plans** Example 1:

work page

[80] [80]

Convert to HTML Example 2:

work page