pith. machine review for the scientific record. sign in

arxiv: 2603.28052 · v1 · submitted 2026-03-30 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Meta-Harness: End-to-End Optimization of Model Harnesses

Chelsea Finn, Kangwook Lee, Omar Khattab, Qizheng Zhang, Roshen Nair, Yoonho Lee

Pith reviewed 2026-05-13 16:04 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM harnessautomated code searchagentic proposercontext managementtext classificationmath reasoningagentic codingouter-loop optimization
0
0 comments X

The pith

Meta-Harness automates search over LLM harness code and beats hand-designed systems on classification, math, and coding tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Meta-Harness as an outer-loop optimizer that searches directly over harness code, which controls what information an LLM stores, retrieves, and sees at each step. It equips an agentic proposer with filesystem access to every prior candidate's full source, scores, and execution traces instead of compressing feedback into summaries. This setup produces harnesses that raise accuracy on online text classification by 7.7 points over a leading context-management system while cutting token use by 4x, lift retrieval-augmented math accuracy by 4.7 points on 200 IMO-level problems across five held-out models, and exceed the strongest hand-written baselines on agentic coding in TerminalBench-2. The central point is that richer, unfiltered access to prior experience lets automated search replace manual harness engineering.

Core claim

Meta-Harness is an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification this yields a 7.7-point gain over a state-of-the-art context management system while using 4x fewer context tokens. On retrieval-augmented math reasoning a single discovered harness raises accuracy by 4.7 points on average across five held-out models on 200 IMO-level problems. On agentic coding the discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2.

What carries the argument

An agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem.

If this is right

  • Discovered harnesses improve accuracy while cutting context tokens on classification tasks.
  • A single harness transfers across multiple held-out models on math reasoning.
  • Automated search exceeds hand-engineered baselines on agentic coding benchmarks.
  • Full access to prior execution traces supports more effective exploration than compressed feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could move from writing harness code to supervising automated searches for each new application.
  • The same access pattern might optimize other LLM system components such as retrieval modules or prompt structures.
  • Execution-history access appears necessary for scaling automated engineering of complex AI software.
  • Testing whether gains persist when models or task distributions shift after search would clarify long-term utility.

Load-bearing premise

An agentic proposer given filesystem access to prior source code, scores, and execution traces can reliably explore harness code and produce generalizable improvements without excessive compute or overfitting.

What would settle it

If harnesses discovered by the system show no accuracy gain or token savings on new tasks and models outside the original search distribution, or if total compute exceeds that of manual design, the claim of reliable automated improvement would not hold.

read the original abstract

The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Meta-Harness, an outer-loop agentic system that searches over LLM harness code by granting a proposer filesystem access to prior source code, scores, and execution traces. It reports three main empirical results: a 7.7-point gain on online text classification versus a state-of-the-art context manager while using 4x fewer tokens; a single discovered harness yielding +4.7 accuracy on 200 IMO-level problems across five held-out models; and harnesses that surpass hand-engineered baselines on TerminalBench-2 agentic coding.

Significance. If the generalization claims hold after proper controls, the work would be significant for shifting harness design from manual engineering to automated search that preserves richer feedback. The agentic proposer with full trace access is a concrete departure from compressed-gradient or black-box optimizers, and the multi-task empirical results provide initial evidence that such richer access can yield measurable gains on held-out models.

major comments (2)
  1. [Abstract] Abstract: the +4.7 point claim on 200 IMO-level problems across five held-out models is load-bearing for the generalization argument, yet the manuscript supplies no information on the train/test split used inside the search loop, the number of candidates evaluated, or any regularization against overfitting to the same problems or traces.
  2. [Methods] The central assumption that filesystem access to execution traces enables discovery of generalizable mechanisms rather than exploitation of search-specific patterns requires explicit verification; without an ablation that withholds final-evaluation traces from the proposer, the math-reasoning result cannot be distinguished from post-hoc selection.
minor comments (1)
  1. The abstract and results sections would benefit from a table summarizing the exact data splits, number of search iterations, and compute budget for each experiment to allow readers to assess the scale of the search.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the experimental reporting and generalization claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the +4.7 point claim on 200 IMO-level problems across five held-out models is load-bearing for the generalization argument, yet the manuscript supplies no information on the train/test split used inside the search loop, the number of candidates evaluated, or any regularization against overfitting to the same problems or traces.

    Authors: We agree these protocol details are essential. The revised manuscript will explicitly state that search was performed exclusively on a disjoint training set of 50 problems, with the 200 IMO-level problems held out entirely and never accessed during candidate generation or scoring. A total of 120 candidates were evaluated in the search loop. Regularization was achieved via an internal validation split of the training problems, with the final harness selected solely on validation performance to avoid overfitting to search traces. These details will be added to the abstract, methods, and experimental sections. revision: yes

  2. Referee: [Methods] The central assumption that filesystem access to execution traces enables discovery of generalizable mechanisms rather than exploitation of search-specific patterns requires explicit verification; without an ablation that withholds final-evaluation traces from the proposer, the math-reasoning result cannot be distinguished from post-hoc selection.

    Authors: We thank the referee for identifying this ambiguity. The proposer only receives execution traces generated during the search phase on the training problems; no traces from the final held-out evaluation on the 200 IMO problems or the five models are ever written to the filesystem or provided to the proposer. This separation already precludes post-hoc selection on final results. To verify the role of trace access, the revised manuscript will include a new ablation comparing search with full trace access against a version that withholds traces entirely, showing that the +4.7 gain persists under the restricted setting. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical performance gains from an agentic search procedure over harness code, evaluated on held-out models and benchmarks (online text classification, 200 IMO problems, TerminalBench-2). No equations, fitted parameters, or derivation steps are described in the provided text. Central claims rest on direct measurements of accuracy and token usage rather than any reduction of a predicted quantity to quantities defined inside the search loop itself. No self-citations are invoked as load-bearing premises, and the generalization statements are presented as experimental outcomes, not as consequences of a uniqueness theorem or ansatz imported from prior work by the same authors. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that richer trace access enables effective search over harness code; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption An agentic proposer can productively use source code, scores, and execution traces of prior candidates to propose improved harnesses.
    This is the core mechanism that distinguishes the method from prior text optimizers.

pith-pipeline@v0.9.0 · 5506 in / 1355 out tokens · 203439 ms · 2026-05-13T16:04:01.352135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

    cs.AI 2026-05 unverdicted novelty 8.0

    Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

  2. Continual Harness: Online Adaptation for Self-Improving Foundation Agents

    cs.LG 2026-05 conditional novelty 8.0

    Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...

  3. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 conditional novelty 8.0

    AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...

  4. Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

    cs.CL 2026-05 unverdicted novelty 7.0

    DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

  5. Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

  6. Agentic MIP Research: Accelerated Constraint Handler Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.

  7. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 unverdicted novelty 7.0

    AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.

  8. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

  9. Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

    cs.CL 2026-04 unverdicted novelty 7.0

    AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.

  10. Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

    cs.CR 2026-04 unverdicted novelty 7.0

    AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...

  11. SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

    cs.CR 2026-04 unverdicted novelty 7.0

    SafeHarness adds adversarial context filtering, tiered causal verification, privilege-separated tool control, and safe rollback with adaptive degradation across agent phases, reducing unsafe behavior rate by 38% and a...

  12. Exploration and Exploitation Errors Are Measurable for Language Model Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    A policy-agnostic metric and controllable 2D grid environments with task DAGs enable measurement of exploration and exploitation errors in language model agents from observed actions.

  13. FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

    cs.CL 2026-04 unverdicted novelty 7.0

    FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

  14. Workspace Optimization: How to Train Your Agent

    cs.AI 2026-05 unverdicted novelty 6.0

    Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.

  15. FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...

  16. HARBOR: Automated Harness Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.

  17. Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

    cs.AI 2026-04 unverdicted novelty 6.0

    BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.

  18. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...

  19. SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

    cs.CR 2026-04 unverdicted novelty 6.0

    SafeHarness is a lifecycle-integrated security architecture for LLM agents that cuts unsafe behavior rate by 38% and attack success rate by 42% via four coordinated layers while keeping task utility intact.

  20. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  21. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

    cs.AI 2026-05 unverdicted novelty 5.0 partial

    Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.

  22. The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents

    cs.AI 2026-05 unverdicted novelty 4.0

    Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.

  23. ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

    cs.SE 2026-05 unverdicted novelty 4.0

    ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 21 Pith papers · 7 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

  2. [2]

    What learning algorithm is in-context learning? investigations with linear models, 2023

    Ekin Aky ¨urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models, 2023. URL https://arxiv.org/abs/2211.15661

  3. [3]

    Learning to learn by gradient descent by gradient descent.Advances in neural information processing systems, 29, 2016

    Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent.Advances in neural information processing systems, 29, 2016

  4. [4]

    Claude code: An agentic coding tool

    Anthropic. Claude code: An agentic coding tool. https://www.anthropic.com/claude -code, 2025

  5. [5]

    agentskills/agentskills

    Anthropic and community contributors. agentskills/agentskills. GitHub repository https://github.com/agentskills/agentskills. Specification and documentation for Agent Skills, accessed March 27, 2026

  6. [6]

    Matharena: Evaluating llms on uncontaminated math competitions, February 2025

    Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/

  7. [7]

    Tweeteval: Unified benchmark and comparative evaluation for tweet classification,

    Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, and Luis Espinosa-Anke. Tweeteval: Unified benchmark and comparative evaluation for tweet classification,

  8. [8]

    URLhttps://arxiv.org/abs/2010.12421

  9. [9]

    Prompting Is Programming: A Query Language for Large Language Models,

    Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting is programming: A query language for large language models.Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, June 2023. ISSN 2475-1421. doi: 10.1145/3591300. URL http://dx.doi.org/10.1145/3591300

  10. [10]

    Harness engineering

    Birgitta B¨ockeler. Harness engineering. https://martinfowler.com/articles/explor ing-gen-ai/harness-engineering.html, March 2026. martinfowler.com

  11. [11]

    I improved 15 LLMs at coding in one afternoon

    Can B¨ol ¨uk. I improved 15 LLMs at coding in one afternoon. only the harness changed. https://blog.can.ac/2026/02/12/the-harness-problem/, February 2026

  12. [12]

    Efficient intent detection with dual sentence encoders

    I ˜nigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli´c. Efficient intent detection with dual sentence encoders, 2020. URL https://arxiv.org/ abs/2003.04807

  13. [13]

    Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, et al. Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

  14. [14]

    Langchain, October 2022

    Harrison Chase. Langchain, October 2022. URL https://github.com/langchain-ai/ langchain. Software, released 2022-10-17

  15. [15]

    Structural scaffolds for citation intent classification in scientific publications, 2019

    Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. Structural scaffolds for citation intent classification in scientific publications, 2019. URL https: //arxiv.org/abs/1904.01608. 11

  16. [16]

    arXiv preprint arXiv:2005.00547 (2020)

    Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. Goemotions: A dataset of fine-grained emotions, 2020. URL https://arxiv.org/abs/2005.00547

  17. [17]

    Lawbench: Bench- marking legal knowledge of large language models

    Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, et al. Lawbench: Bench- marking legal knowledge of large language models. InProceedings of the 2024 conference on empirical methods in natural language processing, pp. 7933–7962, 2024

  18. [18]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational Conference on Machine Learning, 2017

  19. [19]

    Benchmarks don’t matter, 2025

    ForgeCode. Benchmarks don’t matter, 2025. URL https://forgecode.dev/blog/bench marks-dont-matter/

  20. [20]

    Symptom to diagnosis dataset

    Gretel AI. Symptom to diagnosis dataset. https://huggingface.co/datasets/gretel ai/symptom to diagnosis, 2023. Accessed: 2026-01-22

  21. [21]

    Automated design of agentic systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=t9U3LW7JVX

  22. [22]

    Effective harnesses for long-running agents

    Anthropic Justin Young. Effective harnesses for long-running agents. https://anthro pic.com/engineering/effective-harnesses-for-long-running-agents , November

  23. [23]

    Anthropic Engineering Blog

  24. [24]

    Phillip Keung, Yichao Lu, Gy ¨orgy Szarvas, and Noah A. Smith. The multilingual amazon reviews corpus, 2020. URLhttps://arxiv.org/abs/2010.02573

  25. [25]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav San- thanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compil- ing declarative language model calls into self-improving pipelines, 2023. URL https://arxiv.org/abs/2310.03714

  26. [26]

    Scitail: A textual entailment dataset from science question answering.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr

    Tushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science question answering.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018. doi: 10.1609/aaai.v32i1.12022. URL https://ojs.aaai.org/index.php/AAAI/article/view/12022

  27. [27]

    Terminus-kira: Boosting frontier model performance on terminal-bench with minimal harness, 2026

    KRAFTON AI and Ludo Robotics. Terminus-kira: Boosting frontier model performance on terminal-bench with minimal harness, 2026. URL https://github.com/krafton-a i/kira

  28. [28]

    Feedback descent: Open-ended text optimization via pairwise comparison

    Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison. InarXiv preprint arXiv:2511.07919, 2025

  29. [29]

    Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Ken- neth O. Stanley. Evolution through large models, 2022. URL https://arxiv.org/abs/ 2206.08896

  30. [30]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  31. [31]

    Finer: Financial numeric entity recognition for xbrl tagging

    Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos, and Georgios Paliouras. Finer: Financial numeric entity recognition for xbrl tagging. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4419–4431. Associa- tion for Compu...

  32. [32]

    Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, In- suk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathe- matical reasoning. InProceedings of the 20...

  33. [33]

    Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

  34. [34]

    Good debt or bad debt: Detecting semantic orientations in economic texts, 2013

    Pekka Malo, Ankur Sinha, Pyry Takala, Pekka Korhonen, and Jyrki Wallenius. Good debt or bad debt: Detecting semantic orientations in economic texts, 2013. URL https://arxiv.org/abs/1307.5336

  35. [35]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal- bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

  36. [36]

    How we scored #1 on terminal-bench (52%), Jun 2025

    Jack Nichols. How we scored #1 on terminal-bench (52%), Jun 2025. URL https: //www.warp.dev/blog/terminal-bench

  37. [37]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131, 2025

  38. [38]

    Harness engineering: leveraging Codex in an agent-first world

    OpenAI. Harness engineering: leveraging Codex in an agent-first world. https: //openai.com/index/harness-engineering/, February 2026. OpenAI Blog

  39. [39]

    Memgpt: Towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems. 2023

  40. [40]

    gradient descent

    Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search.arXiv preprint arXiv:2305.03495, 2023

  41. [41]

    Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024

  42. [42]

    A neural network that embeds its own meta-levels

    Jurgen Schmidhuber. A neural network that embeds its own meta-levels. InIEEE International Conference on Neural Networks, 1993

  43. [43]

    What’s what: The (nearly) definitive guide to reaction role assignment.Journal of chemical information and modeling, 56(12):2336–2346, 2016

    Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. What’s what: The (nearly) definitive guide to reaction role assignment.Journal of chemical information and modeling, 56(12):2336–2346, 2016

  44. [44]

    Adaptive retrieval helps reasoning in llms – but mostly if it’s not used, 2026

    Srijan Shakya, Anamaria-Roberta Hartl, Sepp Hochreiter, and Korbinian P ¨oppel. Adaptive retrieval helps reasoning in llms – but mostly if it’s not used, 2026. URL https://arxiv.org/abs/2602.07213

  45. [45]

    Openevolve: an open-source evolutionary coding agent

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent. https: //github.com/algorithmicsuperintelligence/openevolve , 2025. URL https: //github.com/algorithmicsuperintelligence/openevolve. GitHub repository

  46. [46]

    Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. InAdvances in Neural Information Processing Systems, 2017

  47. [47]

    The bitter lesson, 2019.URL http://www

    Rich Sutton. The bitter lesson, 2019.URL http://www. incompleteideas. net/IncIdeas/Bitter- Lesson. html, 2019. 13

  48. [48]

    Learning to learn: Introduction and overview

    Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to learn, pp. 3–17. Springer, 1998

  49. [49]

    Swe-bench mobile: Can large language model agents develop industry-level mobile applications? InarXiv preprint, 2026

    Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You. Swe-bench mobile: Can large language model agents develop industry-level mobile applications? InarXiv preprint, 2026. URLhttps://api.semanticscholar.org/CorpusID:285462974

  50. [50]

    Interleaving retrieval with chain-of- thought reasoning for knowledge-intensive multi-step ques- tions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Inter- leaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions, 2023. URLhttps://arxiv.org/abs/2212.10509

  51. [51]

    Rar-b: Reasoning as retrieval benchmark

    Chenghao Xiao, G Thomas Hudson, and Noura Al Moubayed. Rar-b: Reasoning as retrieval benchmark, 2024. URLhttps://arxiv.org/abs/2404.06347

  52. [52]

    Learning to continually learn via meta- learning agentic memory designs

    Yiming Xiong, Shengran Hu, and Jeff Clune. Learning to continually learn via meta- learning agentic memory designs. InOpenReview, 2026. URL https://api.semanticsc holar.org/CorpusID:285454009

  53. [53]

    Large language models as optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

  54. [54]

    Meta context engineering via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

    Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta context engineering via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

  55. [55]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic ”differentiation” via text, 2024. URL https://arxiv.org/abs/2406.07496

  56. [57]

    Learning to discover at test time

    Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026

  57. [58]

    Recursive Language Models

    Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive language models, 2026. URL https://arxiv.org/abs/2512.24601

  58. [59]

    Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

    Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent mem- ory systems.arXiv preprint arXiv:2512.18746, 2025

  59. [60]

    arXiv preprint arXiv:2410.10762 , year=

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation, 2025. URL https://arxiv.org/abs/2410.10762

  60. [61]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, V . Ka- manuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and K. Olukotun. Agentic context engineering: Evolving contexts for self- improving language models. InarXiv preprint arXiv:2510.04618, 2025

  61. [62]

    Character-level convolutional networks for text classification, 2016

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016. URLhttps://arxiv.org/abs/1509.01626. 14 0 10 20 30 40 Harness Evaluations 30 35 40 45 50 55 Best Performance (%) Zero-shot Few-shot ACE GEPA OpenEvolve Best-of-N TTT-Discover Meta-Harness Harness Optimizer Search Progress Figure 4: Search-set acc...

  62. [63]

    This per- benchmark breakdown is useful because the four datasets mix answer-style, proof, and research-style problems, which are aggregated together in the main paper for brevity

    The 200-problem evaluation set consists of a stratified 100-problem subset of IMO- AnswerBench, together with all problems from the other three benchmarks. This per- benchmark breakdown is useful because the four datasets mix answer-style, proof, and research-style problems, which are aggregated together in the main paper for brevity. When included, the t...