pith. machine review for the scientific record. sign in

arxiv: 2604.10352 · v1 · submitted 2026-04-11 · 💻 cs.AI · cs.OS· cs.SE

Recognition: unknown

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

Laurent Bindschaedler, Mofasshara Rafique

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.AI cs.OScs.SE
keywords LLM agentsvirtual memorystate managementtool-using agentsagent harnesscontext windowmemory invariantswriteback
0
0 comments X

The pith

ClawVM turns the LLM agent harness into a virtual memory manager that enforces minimum-fidelity state invariants to prevent loss and ensure durable writeback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current agent harnesses manage the context window as best-effort memory, which produces recurring failures such as lost state after compaction, skipped flushes on reset, and destructive writeback. ClawVM counters this by representing state as typed pages that must satisfy minimum-fidelity requirements, supporting multiple resolutions that stay inside the token budget, and requiring validated writeback at every lifecycle boundary. Because the harness already assembles prompts and observes events, it becomes the natural place to enforce these contracts, making residency and durability deterministic and auditable. A sympathetic reader would care because reliable state persistence is a prerequisite for agents to complete long or multi-step tool-using tasks without manual intervention or hidden failures.

Core claim

ClawVM manages agent state as typed pages with minimum-fidelity invariants and multi-resolution representations under a token budget, together with validated writeback at every lifecycle boundary. Placing the contract in the harness eliminates all policy-controllable faults whenever the minimum-fidelity set fits inside the token budget, as confirmed by an offline oracle, while adding only a median overhead below 50 microseconds per turn across synthetic workloads, real-session traces, and adversarial tests.

What carries the argument

Typed pages carrying minimum-fidelity invariants that guarantee essential state survives compaction and resets through multi-resolution representations and validated writeback enforced by the harness.

If this is right

  • State is never lost after context compaction once the minimum-fidelity requirement is met.
  • Flushes are no longer bypassed when an agent resets.
  • Writeback becomes verified and non-destructive at every lifecycle boundary.
  • All policy-controllable faults disappear under the stated token-budget condition.
  • Overhead stays below 50 microseconds median per turn in both synthetic and real traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent frameworks could adopt harness-level memory contracts instead of relying on ad-hoc prompt engineering for persistence.
  • The same minimum-fidelity approach might extend to other context-limited systems that need auditable state across turns.
  • Replaying the 12 real-session traces with and without the invariants would give a direct measure of fault reduction in practice.
  • Developers could experiment with adaptive fidelity levels that raise or lower the minimum based on task criticality.

Load-bearing premise

The minimum-fidelity set of state always fits inside the token budget and the harness can enforce the invariants and writeback without changing the agent's tool-using behavior or performance.

What would settle it

An observed state-loss fault or durability failure in any workload where an offline oracle has already confirmed that the minimum-fidelity set fits within the token budget would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.10352 by Laurent Bindschaedler, Mofasshara Rafique.

Figure 1
Figure 1. Figure 1: Architecture of a stateful tool-using agent with the ClawVM layer (shaded). The agent harness manages sessions, assembles prompts, mediates tools, and emits lifecycle events (compaction, pruning, flush, reset). ClawVM interposes at the harness level: the page table and representation selector feed prompt assembly, the fault observer instruments paging and lifecycle behavior, and the writeback journal enfor… view at source ↗
read the original abstract

Stateful tool-using LLM agents treat the context window as working memory, yet today's agent harnesses manage residency and durability as best-effort, causing recurring failures: lost state after compaction, bypassed flushes on reset, and destructive writeback. We present \textsc{ClawVM}, a virtual memory layer that manages state as typed pages with minimum-fidelity invariants, multi-resolution representations under a token budget, and validated writeback at every lifecycle boundary. Because the harness already assembles prompts, mediates tools, and observes lifecycle events, it is the natural enforcement point; placing the contract there makes residency and durability deterministic and auditable. Across synthetic workloads, 12 real-session traces, and adversarial stress tests, \textsc{ClawVM} eliminates all policy-controllable faults whenever the minimum-fidelity set fits within the token budget, confirmed by an offline oracle, and adds median <50 microseconds of policy-engine overhead per turn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ClawVM, a harness-managed virtual memory layer for stateful tool-using LLM agents. It models agent state as typed pages subject to minimum-fidelity invariants, supports multi-resolution representations that respect a token budget, and enforces validated writeback at lifecycle boundaries. The central claim is that this design eliminates all policy-controllable faults (lost state after compaction, bypassed flushes, destructive writeback) whenever the minimum-fidelity set fits within the token budget, as verified by an offline oracle replaying traces; the system adds median overhead below 50 microseconds per turn. The evaluation covers synthetic workloads, 12 real-session traces, and adversarial stress tests.

Significance. If the claims hold, ClawVM would provide a practical and auditable way to make state residency and durability deterministic inside existing agent harnesses, addressing a recurring source of unreliability in tool-using LLM agents. The decision to locate enforcement at the harness (which already assembles prompts and observes lifecycle events) is a pragmatic strength. The breadth of workloads (synthetic, 12 real traces, adversarial) is a positive empirical feature, though the absence of detailed metrics, error bars, or methodology in the abstract limits assessment of how strongly the data support the elimination claim.

major comments (2)
  1. [Abstract] Abstract: The central claim that ClawVM 'eliminates all policy-controllable faults whenever the minimum-fidelity set fits within the token budget, confirmed by an offline oracle' rests on replaying fixed traces. Because multi-resolution representations can alter state encoding and therefore agent tool calls and decisions, an offline oracle on original traces compares non-equivalent executions and may miss semantic or interaction faults introduced by the new representations. This is load-bearing for the fault-elimination result and requires either live adaptive evaluation or a clear argument that behavior remains equivalent.
  2. [Abstract] Abstract: The paper states that the minimum-fidelity set 'always fits' under the reported conditions, yet provides no description of how this set is computed, what happens when it does not fit, or the token-budget allocation policy. Without these details the 'whenever' qualifier cannot be evaluated and the practical scope of the guarantee remains unclear.
minor comments (2)
  1. [Abstract] Abstract: Quantitative results (exact fault counts per workload, overhead distributions, token-budget utilization) are summarized only at the level of 'eliminates all' and 'median <50 microseconds'; adding a table or figure reference with these numbers would improve verifiability.
  2. [Abstract] The abstract uses the term 'policy-controllable faults' without a concise definition or enumeration of the fault classes considered; a short list or reference to a later section would clarify scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help us clarify the scope of our claims and the strength of the supporting evidence. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that ClawVM 'eliminates all policy-controllable faults whenever the minimum-fidelity set fits within the token budget, confirmed by an offline oracle' rests on replaying fixed traces. Because multi-resolution representations can alter state encoding and therefore agent tool calls and decisions, an offline oracle on original traces compares non-equivalent executions and may miss semantic or interaction faults introduced by the new representations. This is load-bearing for the fault-elimination result and requires either live adaptive evaluation or a clear argument that behavior remains equivalent.

    Authors: We agree that replaying fixed traces with an offline oracle does not by itself rule out behavioral divergence caused by altered representations. ClawVM's minimum-fidelity invariants are intended to preserve the semantic content required for correct tool calls and decisions, but we recognize that an explicit argument for equivalence is needed to support the fault-elimination claim. In the revised manuscript we will add a subsection that formally argues behavioral equivalence from the invariants and reports empirical checks (no divergence in agent outputs was observed across the 12 traces). We view this as sufficient to address the concern for the workloads evaluated; if the editor prefers, we can also add limited live adaptive runs. revision: partial

  2. Referee: [Abstract] Abstract: The paper states that the minimum-fidelity set 'always fits' under the reported conditions, yet provides no description of how this set is computed, what happens when it does not fit, or the token-budget allocation policy. Without these details the 'whenever' qualifier cannot be evaluated and the practical scope of the guarantee remains unclear.

    Authors: The computation of the minimum-fidelity set, the token-budget allocation policy, and the failure mode when the set does not fit are described in Sections 3.2 and 4.1. The set is obtained by choosing, for each typed page, the coarsest representation that still satisfies the type-specific invariants; the policy first reserves sufficient tokens for all minimum-fidelity pages and then distributes any remainder to higher-resolution variants by recency and access frequency. When the minimum-fidelity set exceeds the budget the harness raises a policy violation rather than allowing lossy state. We accept that these mechanisms are not summarized in the abstract, weakening the clarity of the 'whenever' qualifier. We will revise the abstract to include a concise description of the computation, allocation policy, and violation behavior. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on independent oracle validation

full rationale

The paper's core result—that ClawVM eliminates all policy-controllable faults when the minimum-fidelity set fits the token budget—is presented as an empirical finding confirmed by an offline oracle across synthetic, real-session, and adversarial workloads. No equations, derivations, or first-principles steps are shown that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The design invariants and multi-resolution representations are stated as engineering choices enforced by the harness, with evaluation serving as external verification rather than tautological confirmation. This is the normal non-circular case for a systems paper whose claims are falsifiable via the described oracle.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Abstract introduces new concepts and relies on assumptions about harness capabilities and token budget feasibility without providing independent evidence or derivations.

free parameters (1)
  • token budget
    Constraint under which multi-resolution representations must operate; no specific fitted values or selection process described.
axioms (1)
  • domain assumption The harness is the natural enforcement point for residency and durability invariants because it assembles prompts and observes lifecycle events.
    Invoked to justify placing the contract in the harness.
invented entities (2)
  • typed pages no independent evidence
    purpose: Represent agent state with minimum-fidelity invariants for management.
    New abstraction introduced for context window state.
  • minimum-fidelity invariants no independent evidence
    purpose: Guarantee preservation of essential state under compaction or reset.
    Core mechanism for deterministic behavior.

pith-pipeline@v0.9.0 · 5462 in / 1392 out tokens · 69412 ms · 2026-05-10T15:18:19.285900+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    AmbitiousRealism2025. 2026. Bug: Pre-compaction memory flush uses stale token counts, can be bypassed. GitHub issue #5457, openclaw/openclaw.https://github.com/openclaw/ openclaw/issues/5457Accessed 2026-02-25

  2. [2]

    Chang and Longling Geng

    Edward Y. Chang and Longling Geng. 2025. SagaLLM: Con- text Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning.Proceedings of the VLDB Endow- ment18, 12 (2025), 4874–4886. doi:10.14778/3750601.3750611

  3. [3]

    Jiayang Cheng, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, and Xunliang Cai. 2026. AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Con- versations. InInternational Conference on Learning Represen- tations (ICLR).https://openreview.net/forum?id=sfrVLzsmlf ICLR 2026 poster; published 2026-01-26; last modified 2026-02- 11

  4. [4]

    Cognee. [n. d.]. OpenClaw. Cognee documentation (integra- tion guide).https://docs.cognee.ai/integrations/openclaw- integrationCognee documentation page for the OpenClaw integration

  5. [5]

    Peter J. Denning. 1968. The Working Set Model for Program Behavior.Commun. ACM11, 5 (1968), 323–333. doi:10.1145/ 363095.363141

  6. [6]

    Dowser. 2026. Feature Request: Pre-prompt memory in- jection hook (message:before event). GitHub issue #11910, openclaw/openclaw.https://github.com/openclaw/openclaw/ issues/11910Opened 2026-02-08; closed as not planned. Re- quest for automatic memory injection before prompt process- ing

  7. [7]

    EmberCF. 2026. [Bug] Pre-compaction memory flush prompt causes agents to overwrite existing memory files. GitHub issue #6877, openclaw/openclaw.https://github.com/openclaw/ openclaw/issues/6877Accessed 2026-02-25

  8. [8]

    gavlaahh. 2026. Your agent forgets everything after com- paction. Here’s a fix (open source, $0.10/month). Reddit post (r/openclaw).https://www.reddit.com/r/openclaw/comments/ 1r3nyro/your_agent_forgets_everything_after_compaction/ Posted 2026-02-13 (UTC)

  9. [9]

    haymourt. 2026. OpenClaw “forgot” to run a protocol that we agreed it would. Reddit post (r/AI_Agents). https://www.reddit.com/r/AI_Agents/comments/1qxo8lw/ openclaw_forgot_to_run_a_protocol_that_we_agreed/ Posted 2026-02-06 (UTC)

  10. [10]

    Yuanzhe Hu, Yu Wang, and Julian McAuley. 2026. Evaluating Memory in LLM Agents via Incremental Multi-Turn Interac- tions. InProceedings of the Fourteenth International Conference on Learning Representations (ICLR).https://openreview.net/ forum?id=DT7JyQC3MRICLR 2026

  11. [11]

    Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shu- fan Liu, Xuanzhe Liu, and Xin Jin. 2025. RAGCache: Effi- cient Knowledge Caching for Retrieval-Augmented Genera- tion.ACM Transactions on Computer Systems44, 1, Article 2 (2025), 27 pages. doi:10.1145/3768628Preprint available as arXiv:2404.12457

  12. [12]

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. 2025. Memory OS of AI Agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 25961– 25970. doi:10.18653/v1/2025.emnlp-main.1318

  13. [13]

    Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Huayi Lai, Hao Wu, Bo Tang, Zhengren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan ...

  14. [14]

    Limitless2023. 2026. fix(memory-flush): add softThreshold- Percent for context-relative threshold. GitHub pull request #17041, openclaw/openclaw.https://github.com/openclaw/ openclaw/pull/17041Accessed 2026-02-25

  15. [15]

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2024. AgentBench: Evaluating LLMs as Agents. In The Twelfth International Conference on Lea...

  16. [16]

    Tobias Lütke. 2026. QMD - Query Markup Documents. GitHub repository.https://github.com/tobi/qmdLatest tagged release v1.0.7 dated 2026-02-18

  17. [17]

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Associa- tion for Computational Linguistics, Bangkok, Thailand, 13851– 138...

  18. [18]

    marcmeezy. 2026. Memory flush uses stale token count, fires after compaction instead of before. GitHub issue #2397, openclaw/openclaw.https://github.com/openclaw/openclaw/ issues/2397Opened 2026-01-26; closed. Issue describes token- count accounting lag (using a prior total token count) causing the flush to occur after compaction rather than before

  19. [19]

    Kai Mei, Xi Zhu, Wujiang Xu, Mingyu Jin, Wenyue Hua, Ze- long Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. 2025. AIOS: LLM Agent Operating System. InSec- ond Conference on Language Modeling (COLM 2025). Montreal, Canada.https://openreview.net/forum?id=L4HHkCDz2xAlso available as arXiv:2403.16971

  20. [20]

    Mem0. [n. d.]. OpenClaw. Mem0 documentation (integration guide).https://docs.mem0.ai/integrations/openclawIntegra- tion documentation; npm package: @mem0/openclaw-mem0

  21. [21]

    mmartoccia. 2026. Feature: Workspace-aware post- compaction bootstrap to prevent amnesia. GitHub issue #20225, openclaw/openclaw.https://github.com/openclaw/openclaw/ issues/20225Accessed 2026-02-25

  22. [22]

    Moddy14. 2026. Memory flush softThresholdTokens doesn’t scale with context window size. GitHub issue #17034, open- claw/openclaw.https://github.com/openclaw/openclaw/ issues/17034Accessed 2026-02-25. EuroMLSys ’26, April 27–30, 2026, Edinburgh, Scotland Uk Mofasshara Rafique and Laurent Bindschaedler

  23. [23]

    nickjlamb. 2026. feat: workspace-aware post-compaction context. GitHub pull request #20267, openclaw/openclaw. https://github.com/openclaw/openclaw/pull/20267Accessed 2026-02-25

  24. [24]

    NullSense. 2026. [Feature]: Memory flush on /new and /reset (pre-reset memory save). GitHub issue #8185, openclaw/open- claw.https://github.com/openclaw/openclaw/issues/8185Ac- cessed 2026-02-25

  25. [25]

    OpenClaw. 2026. Compaction. OpenClaw Docs.https://docs. openclaw.ai/concepts/compactionAccessed 2026-02-25

  26. [26]

    OpenClaw. 2026. Configuration Reference. OpenClaw Docs (Gateway & Ops).https://docs.openclaw.ai/gateway/ configuration-referenceAccessed 2026-02-25

  27. [27]

    OpenClaw. 2026. Context. OpenClaw Docs.https://docs. openclaw.ai/concepts/contextAccessed 2026-02-25

  28. [28]

    OpenClaw. 2026. Memory Overview. OpenClaw Docs.https: //docs.openclaw.ai/concepts/memoryAccessed 2026-02-25

  29. [29]

    OpenClaw. 2026. Session Management Deep Dive. OpenClaw Docs (Reference).https://docs.openclaw.ai/reference/session- management-compactionAccessed 2026-02-25

  30. [30]

    Patil, Ion Stoica, and Joseph E

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez

  31. [31]

    MemGPT: Towards LLMs as Operating Systems

    MemGPT: Towards LLMs as Operating Sys- tems. arXiv:2310.08560 [cs.AI] doi:10.48550/arXiv.2310.08560 arXiv:2310.08560; submitted 2023-10-12; last revised 2024-02- 12 (v2)

  32. [32]

    Vicky Zhao, Lili Qiu, and Dongmei Zhang

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xu- fang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang

  33. [33]

    In: Findings of the Association for Computational Linguistics: ACL 2024 (Aug 2024)

    LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. InFindings of the Asso- ciation for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 963–981. doi:10.18653/v1/2024.findings-acl.57

  34. [34]

    O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User In- terface Software and Technology (UIST ’23). Article 2, 22 pages. doi:10.1145/3586183.3606763arXiv:2304.03442

  35. [35]

    Qwibit.ai. 2026. NanoClaw. GitHub repository.https://github. com/qwibitai/nanoclawAccessed 2026-02-25

  36. [36]

    robdih. 2026. OpenClaw Best Practices: What Actually Works After Running It Daily. Reddit post (r/openclaw). https://www.reddit.com/r/openclaw/comments/1r4t9q8/ openclaw_best_practices_what_actually_works_after/Posted 2026-02-14 (UTC)

  37. [37]

    Yiting Shen, Kun Li, Wei Zhou, and Songlin Hu. 2026. Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents. arXiv:2601.19935 [cs.CL] doi:10.48550/arXiv.2601.19935Sub- mitted 2026-01-13

  38. [38]

    Sipeed. 2026. PicoClaw. GitHub repository.https://github. com/sipeed/picoclawAccessed 2026-02-25

  39. [39]

    Skipper-Assistant-1968. 2026. QMD memory_search silently returns empty in Discord guild channels (default scope denies chatType ’channel’). GitHub issue #10191, openclaw/openclaw. https://github.com/openclaw/openclaw/issues/10191Opened Feb 6, 2026; closed. Describes silent empty results when default scope denies chatType "channel"

  40. [40]

    Cognitive architectures for language agents, 2024

    Theodore R. Sumers, Shunyu Yao, Karthik R. Narasimhan, and Thomas L. Griffiths. 2024. Cognitive Architectures for Lan- guage Agents.Transactions on Machine Learning Research (Feb. 2024).https://openreview.net/forum?id=1i6ZCvflQJ Accepted by TMLR; published 2024-02-22; also available as arXiv:2309.02427

  41. [41]

    theghostybot. 2026. Feature: Post-compaction system event injection for context continuity. GitHub issue #10524, open- claw/openclaw.https://github.com/openclaw/openclaw/ issues/10524Accessed 2026-02-25

  42. [42]

    These-Koala9672. 2026. How I solved context loss in long-running Claude agent sessions (OpenClaw). Reddit post (r/ClaudeAI).https://www.reddit.com/r/ClaudeAI/comments/ 1r351ho/how_i_solved_context_loss_in_longrunning_ claude/Posted 2026-02-12 (UTC)

  43. [43]

    Yi Wang, Lihai Yang, Boyu Chen, Gongyi Zou, Kerun Xu, Bo Tang, Feiyu Xiong, Siheng Chen, and Zhiyu Li. 2025. Text2Mem: A Unified Memory Operation Language for Mem- ory Operating System. arXiv:2509.11145 [cs.CL] doi:10.48550/ arXiv.2509.11145Submitted 2025-09-14; last revised 2025-10-23 (v2)

  44. [44]

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. 2025. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InProceedings of the Thirteenth International Conference on Learning Represen- tations (ICLR). Singapore.https://proceedings.iclr.cc/paper_ files/paper/2025/hash/d813d324dbf0598bbdc9c8e79740ed01- Abstrac...

  45. [45]

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. 2024. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. InAd- vances in Neural Infor...

  46. [46]

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. A-Mem: Agentic Memory for LLM Agents. InAdvances in Neural Information Processing Sys- tems.https://neurips.cc/virtual/2025/poster/119020NeurIPS 2025 poster. OpenReview:https://openreview.net/forum?id= FiM0M8gcct. Preprint: arXiv:2502.12110

  47. [47]

    Deshraj Yadav. 2026. We Built Persistent Memory for Open- Claw (FKA Moltbot, ClawdBot) AI Agents. Mem0 blog post. https://mem0.ai/blog/mem0-memory-for-openclawPosted Feb 6, 2026

  48. [48]

    ZeroClaw Labs. 2026. ZeroClaw. GitHub repository.https: //github.com/zeroclaw-labs/zeroclawAccessed 2026-02-25

  49. [49]

    Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, and Jitao Sang. 2025. Memory as Action: Au- tonomous Context Curation for Long-Horizon Agentic Tasks. arXiv:2510.12635 [cs.AI] doi:10.48550/arXiv.2510.12635Sub- mitted 2025-10-14; last revised 2026-01-10 (v2).. ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents EuroMLSys ...

  50. [50]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InThe Twelfth International Conference on Learning Representations.https://proceedings.iclr.cc/paper_files/paper/ 2024/hash...