Recognition: 2 theorem links
· Lean TheoremContinual Harness: Online Adaptation for Self-Improving Foundation Agents
Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3
The pith
A reset-free harness lets foundation agents refine their own prompts, skills, and memory online from raw interfaces, closing most of the gap to expert performance in long-horizon games.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continual Harness is a reset-free self-improving harness for embodied agents that formalizes and automates online adaptation: starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data, which on Pokemon Red and Emerald across frontier models substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness with capability-dependent gains, and which further enables an online process-reward co-learning loop that drives sustained in-game milestone progress without resetting the environment.
What carries the argument
Continual Harness: the online alternation between acting in the environment and self-refining the agent's prompt, sub-agents, skills, and memory using past trajectory data within a single continuous run.
If this is right
- On frontier models for Pokemon Red and Emerald, Continual Harness starting from scratch reduces button-press cost relative to the minimalist baseline.
- It recovers a majority of the performance gap to a hand-engineered expert harness despite using the same raw interface.
- Gains are capability-dependent, appearing across different foundation models.
- The added online process-reward co-learning loop produces sustained in-game milestone progress on Pokemon Red without environment resets between training iterations.
Where Pith is reading between the lines
- The same online refinement loop could support real-world robotics tasks where resets are expensive or unsafe.
- Self-refinement from raw trajectories may allow agents to discover strategies that human harness designers did not anticipate.
- Combining the harness with periodic model updates creates a pathway for continuous capability growth without separate training phases.
- The approach may extend to other long-horizon partial-observability domains such as navigation or multi-step planning.
Load-bearing premise
The foundation model can reliably and productively refine its own prompt, sub-agents, skills, and memory from past trajectory data in an online setting without performance degradation or looping into suboptimal strategies.
What would settle it
A single long unreset run on one of the tested games in which the agent's button-press efficiency stops improving or begins to decline after initial gains, or in which the self-refinement loop requires external intervention to continue.
read the original abstract
Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Continual Harness, a reset-free framework that enables foundation models to alternate between acting in long-horizon partial-observability environments and autonomously refining their own prompts, sub-agents, skills, and memory using only past trajectory data. It reports results on Pokemon Red and Emerald showing that the approach, starting from a minimal interface with no curated knowledge or tools, reduces button-press costs relative to a minimalist baseline and recovers a majority of the performance gap to a hand-engineered expert harness, with gains scaling by model capability; it further demonstrates an online process-reward co-learning loop that sustains milestone progress without environment resets.
Significance. If the empirical claims hold under rigorous evaluation, the work would be significant for demonstrating practical online self-improvement in embodied agents without human intervention or resets, extending observed emergent behaviors from human-in-the-loop setups into a fully automated harness. The capability-dependent gains and the closed-loop co-learning component provide concrete evidence of productive adaptation from raw interfaces, which could influence design of autonomous agents in similar domains.
major comments (2)
- [Abstract] Abstract: the central claim of substantial button-press cost reduction and recovery of a majority of the gap to the expert harness is presented without any quantitative metrics, error bars, number of runs, or statistical tests, which is load-bearing for assessing whether the gains reflect genuine adaptation rather than selected trajectories or post-hoc choices.
- [Continual Harness framework] The Continual Harness framework description: no details are provided on revision triggers, validation of proposed refinements to prompts/skills/memory, or recovery mechanisms from error compounding in noisy trajectory data, which directly bears on the weakest assumption that online self-refinement remains productive without entering unrecoverable suboptimal loops in partial-observability settings like Pokemon.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from explicit comparison of the exact button-press cost metric used versus prior GPP human-in-the-loop results to clarify continuity with the motivating experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the two major comments point-by-point below and will incorporate revisions to strengthen the presentation of our results and framework details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of substantial button-press cost reduction and recovery of a majority of the gap to the expert harness is presented without any quantitative metrics, error bars, number of runs, or statistical tests, which is load-bearing for assessing whether the gains reflect genuine adaptation rather than selected trajectories or post-hoc choices.
Authors: We agree that including quantitative support in the abstract would improve transparency for the central empirical claims. The experiments section of the manuscript reports specific metrics (e.g., button-press cost reductions and gap recovery percentages across models), number of runs, and variability measures. In the revised manuscript we will add concise quantitative statements to the abstract, such as approximate percentage reductions and the fraction of the expert gap recovered, while retaining the high-level summary style. This directly addresses the concern about assessing genuine adaptation. revision: yes
-
Referee: [Continual Harness framework] The Continual Harness framework description: no details are provided on revision triggers, validation of proposed refinements to prompts/skills/memory, or recovery mechanisms from error compounding in noisy trajectory data, which directly bears on the weakest assumption that online self-refinement remains productive without entering unrecoverable suboptimal loops in partial-observability settings like Pokemon.
Authors: The referee correctly notes that the current framework description is high-level and omits explicit operational details on revision triggers, validation of refinements, and recovery from error compounding. These elements are implemented in our experiments but not fully elaborated in the text. We will expand the Continual Harness section in the revision to specify: (1) revision triggers (e.g., after milestone detection or performance plateau thresholds derived from trajectory statistics), (2) validation procedures (e.g., simulated rollout checks or consistency scoring against recent successful trajectories before committing changes to prompts/skills/memory), and (3) recovery mechanisms (e.g., fallback to prior stable configurations or periodic lightweight resets of sub-agents when trajectory noise indicators exceed thresholds). This addition will clarify how the system avoids unrecoverable loops in partial-observability environments like Pokemon while preserving the reset-free online property. revision: yes
Circularity Check
No significant circularity in claimed derivation or results
full rationale
The paper reports empirical performance gains for Continual Harness on Pokemon Red/Emerald by direct measurement of button-press cost and milestone progress against two external baselines (minimalist raw interface and hand-engineered expert harness). The method is described as automating observed self-refinement behavior from prior GPP runs, but the evaluation chain relies on independent environment interactions and comparisons rather than any self-defined fitted quantities, renamed patterns, or load-bearing self-citations that reduce the central claim to its own inputs by construction. No equations or formal derivations are present that would trigger self-definitional or fitted-input patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Frontier models can use long-context memory to surface and act on emergent self-improvement signals from past trajectories.
invented entities (1)
-
Continual Harness
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearthe agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data... every F steps, a Refiner reads the recent trajectory for failure signatures and runs four passes over the harness applying CRUD edits
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclearContinual Harness... reset-free self-improving harness... online in-context learning over the harness state
Reference graph
Works this paper leans on
-
[1]
L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025. 1, 3.1, 5.1
work page internal anchor Pith review arXiv 2025
-
[2]
Claude code.https://docs.anthropic.com/en/docs/claude-code, 2025
Anthropic. Claude code.https://docs.anthropic.com/en/docs/claude-code, 2025. 1, 5.1
work page 2025
-
[3]
Claude plays Pokémon.https://www.twitch.tv/claudeplayspokemon, 2025
Anthropic. Claude plays Pokémon.https://www.twitch.tv/claudeplayspokemon, 2025. 5.2
work page 2025
-
[4]
A. Gupta, J. Yu, T. Z. Zhao, V. Kumar, A. Rovinsky, K. Xu, T. Devlin, and S. Levine. Reset-free reinforcement learning via multi-task learning: Learning dexterous manipulation behaviors without human intervention. In2021 IEEE international conference on robotics and automation (ICRA), pages 6664–6671. IEEE, 2021. 5.3
work page 2021
-
[5]
S. Karten, W. Li, Z. Ding, S. Kleiner, Y. Bai, and C. Jin. Llm economist: Large popu- lation models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025. 5.3
- [6]
-
[7]
S. Karten, J. Grigsby, T. Upaa Jr, J. Bae, S. Hong, H. Jeong, J. Jung, K. Kerdthaisong, G. Kim, H. Kim, et al. The pokeagent challenge: Competitive and long-context learning at scale.arXiv preprint arXiv:2603.15563, 2026. 1, 2.2, 4.1, 4.1, 5.1, 5.2, A
- [8]
-
[9]
Pokémon Emerald any% glitchless speedrun (mgba)
keepingiticy. Pokémon Emerald any% glitchless speedrun (mgba). Speedrun.com, 2024. URLhttps://www.speedrun.com/pkmnemerald/runs/yvpvw74y. Any9
work page 2024
-
[10]
Y. Lee, R. Nair, Q. Zhang, K. Lee, O. Khattab, and C. Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026. 5.1
work page internal anchor Pith review arXiv 2026
-
[11]
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023. 5.3
work page 2023
- [12]
-
[13]
Hermes agent.https://github.com/NousResearch/hermes-agent, 2026
Nous Research. Hermes agent.https://github.com/NousResearch/hermes-agent, 2026. Accessed: 2026-03-22. 5.1
work page 2026
-
[14]
K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366, 2024. 3.1, 5.1
work page 2024
-
[15]
S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011. 4.5, D.1, D.4
work page 2011
-
[16]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 5.3, D.1
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [17]
-
[18]
K. Song, A. Moeini, P. Wang, L. Gong, R. Chandra, S. Zhang, and Y. Qi. Reward is enough: Llms are in-context reinforcement learners.arXiv preprint arXiv:2506.06303, 2025. 5.3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
P. Steinberger. OpenClaw: An open-source autonomous AI agent.https://github.com/ psteinb/openclaw, 2025. Originally released as Clawdbot, November 2025. 1, 5.1
work page 2025
-
[20]
G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. 5.2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024. 1, 5.1 11 Conti...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Y. Wang, X. Chen, X. Jin, M. Wang, and L. Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026. 5.3, D.1
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022. 5.3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
E. Zelikman, Y. Wu, J. Mu, and N. Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022. 5.3
work page 2022
-
[25]
A. L. Zhang, T. Kraska, and O. Khattab. Recursive language models.arXiv preprint arXiv:2512.24601, 2025. 5.3 12 Continual Harness Appendix Contents A Pokémon Environment 14 B Gemini Plays Pokémon: Additional Evidence 14 B.1 Yellow Legacy Battle-Agent Evolution Checkpoints . . . . . . . . . . . . . . . . . 15 B.2 Crystal Battle Advisor Evolution Checkpoint...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
It deleted its existingget_next_pokemon_presstool. 16 Continual Harness 0 20 40 Turn (thousands) 0 200 400 600 800 Lines changed per 500-turn bin (a) 2.5 Pro updates Skills Sub-agents 0 20 40 Turn (thousands) 0 200 400 600 800 (b) 3 Pro updates Skills Sub-agents 0 20 40 Turn (thousands) 0 5 10 15 20 Updates per 500-turn bin (c) 2.5 Pro fixation find_path ...
-
[27]
It wrote a new tool calledfly_menu_navigator, setting itsautopress_buttons flag to true
-
[28]
It added a directive to its persistent memory:“I must use thefly_menu_navigatortool as intended and trust its output. Theget_next_pokemon_press tool was deleted to make space for fly_menu_navigator and should not have been used. This also highlights a failure to immediately use a newly defined tool.” Schema Mismatch and Execution Loop.The agent invoked th...
-
[29]
Action analysis Move or Switch
- [30]
-
[31]
Dual-type offense calculation 5. Survival and threat assessment Return action b1.Turn 139085: compact rebuild Free turn Other directive None Move Switch Start turn analysis Use battle_screen_text as primary local context Immediate context directive present? Free-turn exception bypasses the HP gate Follow directive before normal strategy
-
[37]
Defensive switching analysis Super effective, then STAB, then neutral Explicit dual-type offense calculation with immunity collapse Return action AI prediction and counter-switch risk Assess likely STAB damage first Assess known coverage and all known moves Check both STAB branches for dual- type opponents Weigh switch cost against immediate damage Switch...
-
[38]
Defensive check 2. Offensive check 3. Strategic switching 4. Status moves and setup 5. Default offense Return action d1.Turn 156631: master-agent intro Figure 12.The four Yellow Legacy battle-agent checkpoints markeda1/b1/c1/d1on the complexity plot in Figure 4. These span the arc from a linear survival-gate chain (a1), through a hard-reset compact rebuil...
-
[40]
Opponent stat-change analysis
-
[42]
Action analysis on the current opponent
-
[44]
Defensive switching analysis Super effective, then STAB, then neutral Explicit dual-type offense calculation with immunity collapse Return action AI prediction and counter-switch risk Assess likely STAB damage first Assess known coverage and all known moves Check both STAB branches for dual- type opponents Weigh switch cost against immediate damage Switch...
-
[46]
Defensive switching analysis Explicit dual-type offense calculation Explicit defensive dual-type calculation Switch target must be viable and not fainted
-
[47]
Survival and coverage awareness Return action Turn 141323: context + viability Yes No Move Switch Start turn analysis Immediate context directive present? Follow directive before normal strategy Reason only about the current opponent
-
[50]
Defensive switching analysis Explicit dual-type offense calculation Switch-cost nuance and defensive dual-type threat Switch target must remain offensively viable Switch target must not be fainted Return action Turn 146358: current opponent + HP Free turn Other directive None Move Switch Start turn analysis Immediate context directive present? Free-turn e...
-
[51]
Status gate on active Pokemon
-
[52]
Prefer invulnerability move when low HP
-
[53]
Do not switch to sleeping Pokemon Move or Switch
-
[54]
Defensive switching analysis Explicit dual-type offense calculation Assess likely STAB damage first Comprehensive threat assessment over all known moves Switch target must remain offensively viable Switch target must not be fainted Return action Turn 147516: free turn + coverage Start turn analysis Parse inputs Generate type map Build structured battle st...
-
[55]
Mandatory defensive actions 1.5
Existential threat 1. Mandatory defensive actions 1.5. Exploit recharge turn 2. Offensive knockout
-
[56]
Strategic disruption and high- risk plays 3.5. High-risk switch-in 4. Strategic switching 5. Standard offense Return action plus recommended lead Turn 159079: hierarchical master Start turn analysis Parse inputs, including active Pokemon status Generate type map Build structured battle state Recommend lead Switch-viability precheck Invalid-switch filter D...
-
[57]
Mandatory defensive actions 1.1
Existential threat 1. Mandatory defensive actions 1.1. Lone-survivor fallback 1.5. Opponent-status exploitation 2. Offensive knockout
-
[58]
Strategic disruption and gambling 3.5. High-risk switch-in 4. Strategic switching 5. Standard offense Return action plus recommended lead Turn 160511: final master Figure 13.The remaining ten Yellow Legacy battle-agent checkpoints, grouped by natural aspect. Row 1: long-chain variants around the first complexity spike and the late “last-stand” rewrite. Ro...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.