pith. machine review for the scientific record. sign in

arxiv: 2605.09998 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

Chengshuai Shi, Chi Jin, Joel Zhang, Kiran Vodrahalli, Ruirong Feng, Seth Karten, Tersoo Upaa Jr, Wenzhe Li

Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords continual learningself-improving agentsembodied AIfoundation modelsonline adaptationharness designlong-horizon decision making
0
0 comments X

The pith

A reset-free harness lets foundation agents refine their own prompts, skills, and memory online from raw interfaces, closing most of the gap to expert performance in long-horizon games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Continual Harness as a method for embodied foundation agents to improve themselves without human oversight or environment resets. The agent starts with only a minimal interface and alternates between acting in the world and updating its prompt, sub-agents, skills, and memory using data from any past trajectories. Experiments on Pokemon Red and Emerald show this cuts button-press costs compared with a basic setup and recovers most of the advantage held by a hand-engineered expert harness. A separate loop uses the agent's rollouts to label data that updates an open-source model, producing ongoing milestone progress in a single continuous run. A sympathetic reader would care because the approach removes the need for episode resets that most prompt-optimization methods require, pointing toward agents that can sustain adaptation in partial-observability settings.

Core claim

Continual Harness is a reset-free self-improving harness for embodied agents that formalizes and automates online adaptation: starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data, which on Pokemon Red and Emerald across frontier models substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness with capability-dependent gains, and which further enables an online process-reward co-learning loop that drives sustained in-game milestone progress without resetting the environment.

What carries the argument

Continual Harness: the online alternation between acting in the environment and self-refining the agent's prompt, sub-agents, skills, and memory using past trajectory data within a single continuous run.

If this is right

  • On frontier models for Pokemon Red and Emerald, Continual Harness starting from scratch reduces button-press cost relative to the minimalist baseline.
  • It recovers a majority of the performance gap to a hand-engineered expert harness despite using the same raw interface.
  • Gains are capability-dependent, appearing across different foundation models.
  • The added online process-reward co-learning loop produces sustained in-game milestone progress on Pokemon Red without environment resets between training iterations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same online refinement loop could support real-world robotics tasks where resets are expensive or unsafe.
  • Self-refinement from raw trajectories may allow agents to discover strategies that human harness designers did not anticipate.
  • Combining the harness with periodic model updates creates a pathway for continuous capability growth without separate training phases.
  • The approach may extend to other long-horizon partial-observability domains such as navigation or multi-step planning.

Load-bearing premise

The foundation model can reliably and productively refine its own prompt, sub-agents, skills, and memory from past trajectory data in an online setting without performance degradation or looping into suboptimal strategies.

What would settle it

A single long unreset run on one of the tested games in which the agent's button-press efficiency stops improving or begins to decline after initial gains, or in which the self-refinement loop requires external intervention to continue.

read the original abstract

Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Continual Harness, a reset-free framework that enables foundation models to alternate between acting in long-horizon partial-observability environments and autonomously refining their own prompts, sub-agents, skills, and memory using only past trajectory data. It reports results on Pokemon Red and Emerald showing that the approach, starting from a minimal interface with no curated knowledge or tools, reduces button-press costs relative to a minimalist baseline and recovers a majority of the performance gap to a hand-engineered expert harness, with gains scaling by model capability; it further demonstrates an online process-reward co-learning loop that sustains milestone progress without environment resets.

Significance. If the empirical claims hold under rigorous evaluation, the work would be significant for demonstrating practical online self-improvement in embodied agents without human intervention or resets, extending observed emergent behaviors from human-in-the-loop setups into a fully automated harness. The capability-dependent gains and the closed-loop co-learning component provide concrete evidence of productive adaptation from raw interfaces, which could influence design of autonomous agents in similar domains.

major comments (2)
  1. [Abstract] Abstract: the central claim of substantial button-press cost reduction and recovery of a majority of the gap to the expert harness is presented without any quantitative metrics, error bars, number of runs, or statistical tests, which is load-bearing for assessing whether the gains reflect genuine adaptation rather than selected trajectories or post-hoc choices.
  2. [Continual Harness framework] The Continual Harness framework description: no details are provided on revision triggers, validation of proposed refinements to prompts/skills/memory, or recovery mechanisms from error compounding in noisy trajectory data, which directly bears on the weakest assumption that online self-refinement remains productive without entering unrecoverable suboptimal loops in partial-observability settings like Pokemon.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from explicit comparison of the exact button-press cost metric used versus prior GPP human-in-the-loop results to clarify continuity with the motivating experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point-by-point below and will incorporate revisions to strengthen the presentation of our results and framework details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of substantial button-press cost reduction and recovery of a majority of the gap to the expert harness is presented without any quantitative metrics, error bars, number of runs, or statistical tests, which is load-bearing for assessing whether the gains reflect genuine adaptation rather than selected trajectories or post-hoc choices.

    Authors: We agree that including quantitative support in the abstract would improve transparency for the central empirical claims. The experiments section of the manuscript reports specific metrics (e.g., button-press cost reductions and gap recovery percentages across models), number of runs, and variability measures. In the revised manuscript we will add concise quantitative statements to the abstract, such as approximate percentage reductions and the fraction of the expert gap recovered, while retaining the high-level summary style. This directly addresses the concern about assessing genuine adaptation. revision: yes

  2. Referee: [Continual Harness framework] The Continual Harness framework description: no details are provided on revision triggers, validation of proposed refinements to prompts/skills/memory, or recovery mechanisms from error compounding in noisy trajectory data, which directly bears on the weakest assumption that online self-refinement remains productive without entering unrecoverable suboptimal loops in partial-observability settings like Pokemon.

    Authors: The referee correctly notes that the current framework description is high-level and omits explicit operational details on revision triggers, validation of refinements, and recovery from error compounding. These elements are implemented in our experiments but not fully elaborated in the text. We will expand the Continual Harness section in the revision to specify: (1) revision triggers (e.g., after milestone detection or performance plateau thresholds derived from trajectory statistics), (2) validation procedures (e.g., simulated rollout checks or consistency scoring against recent successful trajectories before committing changes to prompts/skills/memory), and (3) recovery mechanisms (e.g., fallback to prior stable configurations or periodic lightweight resets of sub-agents when trajectory noise indicators exceed thresholds). This addition will clarify how the system avoids unrecoverable loops in partial-observability environments like Pokemon while preserving the reset-free online property. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation or results

full rationale

The paper reports empirical performance gains for Continual Harness on Pokemon Red/Emerald by direct measurement of button-press cost and milestone progress against two external baselines (minimalist raw interface and hand-engineered expert harness). The method is described as automating observed self-refinement behavior from prior GPP runs, but the evaluation chain relies on independent environment interactions and comparisons rather than any self-defined fitted quantities, renamed patterns, or load-bearing self-citations that reduce the central claim to its own inputs by construction. No equations or formal derivations are present that would trigger self-definitional or fitted-input patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that frontier models possess sufficient long-context reasoning to perform productive self-refinement from raw trajectories; no explicit free parameters or invented physical entities are introduced beyond the harness framework itself.

axioms (1)
  • domain assumption Frontier models can use long-context memory to surface and act on emergent self-improvement signals from past trajectories.
    Invoked when describing the transition from human-in-the-loop GPP to fully automated Continual Harness.
invented entities (1)
  • Continual Harness no independent evidence
    purpose: Reset-free self-improving harness that automates prompt, sub-agent, skill, and memory refinement online.
    New framework introduced to formalize observed GPP behaviors; no independent falsifiable evidence outside the paper's experiments.

pith-pipeline@v0.9.0 · 5629 in / 1485 out tokens · 56095 ms · 2026-05-12T03:46:11.665246+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 9 internal anchors

  1. [1]

    L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025. 1, 3.1, 5.1

  2. [2]

    Claude code.https://docs.anthropic.com/en/docs/claude-code, 2025

    Anthropic. Claude code.https://docs.anthropic.com/en/docs/claude-code, 2025. 1, 5.1

  3. [3]

    Claude plays Pokémon.https://www.twitch.tv/claudeplayspokemon, 2025

    Anthropic. Claude plays Pokémon.https://www.twitch.tv/claudeplayspokemon, 2025. 5.2

  4. [4]

    Gupta, J

    A. Gupta, J. Yu, T. Z. Zhao, V. Kumar, A. Rovinsky, K. Xu, T. Devlin, and S. Levine. Reset-free reinforcement learning via multi-task learning: Learning dexterous manipulation behaviors without human intervention. In2021 IEEE international conference on robotics and automation (ICRA), pages 6664–6671. IEEE, 2021. 5.3

  5. [5]

    LLM economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

    S. Karten, W. Li, Z. Ding, S. Kleiner, Y. Bai, and C. Jin. Llm economist: Large popu- lation models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025. 5.3

  6. [6]

    Karten, A

    S. Karten, A. L. Nguyen, and C. Jin. Pokéchamp: an expert-level minimax language agent. arXiv preprint arXiv:2503.04094, 2025. 5.2 10 Continual Harness

  7. [7]

    The pokeagent challenge: Competitive and long-context learning at scale.arXiv preprint arXiv:2603.15563, 2026

    S. Karten, J. Grigsby, T. Upaa Jr, J. Bae, S. Hong, H. Jeong, J. Jung, K. Kerdthaisong, G. Kim, H. Kim, et al. The pokeagent challenge: Competitive and long-context learning at scale.arXiv preprint arXiv:2603.15563, 2026. 1, 2.2, 4.1, 4.1, 5.1, 5.2, A

  8. [8]

    Karten, A

    S. Karten, A. L. Nguyen, S. Milani, and C. Jin. Small experts, big students: Distilling long-horizon RL policies into LLM agents via imitation learning. 2026. 4.5, D.1, D.4

  9. [9]

    Pokémon Emerald any% glitchless speedrun (mgba)

    keepingiticy. Pokémon Emerald any% glitchless speedrun (mgba). Speedrun.com, 2024. URLhttps://www.speedrun.com/pkmnemerald/runs/yvpvw74y. Any9

  10. [10]

    Y. Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, and C. Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026. 5.1

  11. [11]

    Lightman, V

    H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023. 5.3

  12. [12]

    Madaan, N

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023. 5.1

  13. [13]

    Hermes agent.https://github.com/NousResearch/hermes-agent, 2026

    Nous Research. Hermes agent.https://github.com/NousResearch/hermes-agent, 2026. Accessed: 2026-03-22. 5.1

  14. [14]

    Opsahl-Ong, M

    K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366, 2024. 3.1, 5.1

  15. [15]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011. 4.5, D.1, D.4

  16. [16]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 5.3, D.1

  17. [17]

    Shinn, F

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36: 8634–8652, 2023. 5.1

  18. [18]

    K. Song, A. Moeini, P. Wang, L. Gong, R. Chandra, S. Zhang, and Y. Qi. Reward is enough: Llms are in-context reinforcement learners.arXiv preprint arXiv:2506.06303, 2025. 5.3

  19. [19]

    Steinberger

    P. Steinberger. OpenClaw: An open-source autonomous AI agent.https://github.com/ psteinb/openclaw, 2025. Originally released as Clawdbot, November 2025. 1, 5.1

  20. [20]

    G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. 5.2

  21. [21]

    X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024. 1, 5.1 11 Conti...

  22. [22]

    Y. Wang, X. Chen, X. Jin, M. Wang, and L. Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026. 5.3, D.1

  23. [23]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022. 5.3

  24. [24]

    Zelikman, Y

    E. Zelikman, Y. Wu, J. Mu, and N. Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022. 5.3

  25. [25]

    A. L. Zhang, T. Kraska, and O. Khattab. Recursive language models.arXiv preprint arXiv:2512.24601, 2025. 5.3 12 Continual Harness Appendix Contents A Pokémon Environment 14 B Gemini Plays Pokémon: Additional Evidence 14 B.1 Yellow Legacy Battle-Agent Evolution Checkpoints . . . . . . . . . . . . . . . . . 15 B.2 Crystal Battle Advisor Evolution Checkpoint...

  26. [26]

    It deleted its existingget_next_pokemon_presstool. 16 Continual Harness 0 20 40 Turn (thousands) 0 200 400 600 800 Lines changed per 500-turn bin (a) 2.5 Pro updates Skills Sub-agents 0 20 40 Turn (thousands) 0 200 400 600 800 (b) 3 Pro updates Skills Sub-agents 0 20 40 Turn (thousands) 0 5 10 15 20 Updates per 500-turn bin (c) 2.5 Pro fixation find_path ...

  27. [27]

    It wrote a new tool calledfly_menu_navigator, setting itsautopress_buttons flag to true

  28. [28]

    Power Plant

    It added a directive to its persistent memory:“I must use thefly_menu_navigatortool as intended and trust its output. Theget_next_pokemon_press tool was deleted to make space for fly_menu_navigator and should not have been used. This also highlights a failure to immediately use a newly defined tool.” Schema Mismatch and Execution Loop.The agent invoked th...

  29. [29]

    Action analysis Move or Switch

  30. [30]

    Strategic switching

    Offensive analysis 3. Strategic switching

  31. [31]

    Dual-type offense calculation 5. Survival and threat assessment Return action b1.Turn 139085: compact rebuild Free turn Other directive None Move Switch Start turn analysis Use battle_screen_text as primary local context Immediate context directive present? Free-turn exception bypasses the HP gate Follow directive before normal strategy

  32. [37]

    Defensive switching analysis Super effective, then STAB, then neutral Explicit dual-type offense calculation with immunity collapse Return action AI prediction and counter-switch risk Assess likely STAB damage first Assess known coverage and all known moves Check both STAB branches for dual- type opponents Weigh switch cost against immediate damage Switch...

  33. [38]

    Offensive check 3

    Defensive check 2. Offensive check 3. Strategic switching 4. Status moves and setup 5. Default offense Return action d1.Turn 156631: master-agent intro Figure 12.The four Yellow Legacy battle-agent checkpoints markeda1/b1/c1/d1on the complexity plot in Figure 4. These span the arc from a linear survival-gate chain (a1), through a hard-reset compact rebuil...

  34. [40]

    Opponent stat-change analysis

  35. [42]

    Action analysis on the current opponent

  36. [44]

    Defensive switching analysis Super effective, then STAB, then neutral Explicit dual-type offense calculation with immunity collapse Return action AI prediction and counter-switch risk Assess likely STAB damage first Assess known coverage and all known moves Check both STAB branches for dual- type opponents Weigh switch cost against immediate damage Switch...

  37. [46]

    Defensive switching analysis Explicit dual-type offense calculation Explicit defensive dual-type calculation Switch target must be viable and not fainted

  38. [47]

    Survival and coverage awareness Return action Turn 141323: context + viability Yes No Move Switch Start turn analysis Immediate context directive present? Follow directive before normal strategy Reason only about the current opponent

  39. [50]

    Defensive switching analysis Explicit dual-type offense calculation Switch-cost nuance and defensive dual-type threat Switch target must remain offensively viable Switch target must not be fainted Return action Turn 146358: current opponent + HP Free turn Other directive None Move Switch Start turn analysis Immediate context directive present? Free-turn e...

  40. [51]

    Status gate on active Pokemon

  41. [52]

    Prefer invulnerability move when low HP

  42. [53]

    Do not switch to sleeping Pokemon Move or Switch

  43. [54]

    Defensive switching analysis Explicit dual-type offense calculation Assess likely STAB damage first Comprehensive threat assessment over all known moves Switch target must remain offensively viable Switch target must not be fainted Return action Turn 147516: free turn + coverage Start turn analysis Parse inputs Generate type map Build structured battle st...

  44. [55]

    Mandatory defensive actions 1.5

    Existential threat 1. Mandatory defensive actions 1.5. Exploit recharge turn 2. Offensive knockout

  45. [56]

    High-risk switch-in 4

    Strategic disruption and high- risk plays 3.5. High-risk switch-in 4. Strategic switching 5. Standard offense Return action plus recommended lead Turn 159079: hierarchical master Start turn analysis Parse inputs, including active Pokemon status Generate type map Build structured battle state Recommend lead Switch-viability precheck Invalid-switch filter D...

  46. [57]

    Mandatory defensive actions 1.1

    Existential threat 1. Mandatory defensive actions 1.1. Lone-survivor fallback 1.5. Opponent-status exploitation 2. Offensive knockout

  47. [58]

    last-stand

    Strategic disruption and gambling 3.5. High-risk switch-in 4. Strategic switching 5. Standard offense Return action plus recommended lead Turn 160511: final master Figure 13.The remaining ten Yellow Legacy battle-agent checkpoints, grouped by natural aspect. Row 1: long-chain variants around the first complexity spike and the late “last-stand” rewrite. Ro...