arxiv: 2603.28052 · v1 · submitted 2026-03-30 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Meta-Harness: End-to-End Optimization of Model Harnesses

Chelsea Finn, Kangwook Lee, Omar Khattab, Qizheng Zhang, Roshen Nair, Yoonho Lee

Pith reviewed 2026-05-13 16:04 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM harnessautomated code searchagentic proposercontext managementtext classificationmath reasoningagentic codingouter-loop optimization

0 comments

The pith

Meta-Harness automates search over LLM harness code and beats hand-designed systems on classification, math, and coding tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Meta-Harness as an outer-loop optimizer that searches directly over harness code, which controls what information an LLM stores, retrieves, and sees at each step. It equips an agentic proposer with filesystem access to every prior candidate's full source, scores, and execution traces instead of compressing feedback into summaries. This setup produces harnesses that raise accuracy on online text classification by 7.7 points over a leading context-management system while cutting token use by 4x, lift retrieval-augmented math accuracy by 4.7 points on 200 IMO-level problems across five held-out models, and exceed the strongest hand-written baselines on agentic coding in TerminalBench-2. The central point is that richer, unfiltered access to prior experience lets automated search replace manual harness engineering.

Core claim

Meta-Harness is an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification this yields a 7.7-point gain over a state-of-the-art context management system while using 4x fewer context tokens. On retrieval-augmented math reasoning a single discovered harness raises accuracy by 4.7 points on average across five held-out models on 200 IMO-level problems. On agentic coding the discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2.

What carries the argument

An agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem.

If this is right

Discovered harnesses improve accuracy while cutting context tokens on classification tasks.
A single harness transfers across multiple held-out models on math reasoning.
Automated search exceeds hand-engineered baselines on agentic coding benchmarks.
Full access to prior execution traces supports more effective exploration than compressed feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could move from writing harness code to supervising automated searches for each new application.
The same access pattern might optimize other LLM system components such as retrieval modules or prompt structures.
Execution-history access appears necessary for scaling automated engineering of complex AI software.
Testing whether gains persist when models or task distributions shift after search would clarify long-term utility.

Load-bearing premise

An agentic proposer given filesystem access to prior source code, scores, and execution traces can reliably explore harness code and produce generalizable improvements without excessive compute or overfitting.

What would settle it

If harnesses discovered by the system show no accuracy gain or token savings on new tasks and models outside the original search distribution, or if total compute exceeds that of manual design, the claim of reliable automated improvement would not hold.

read the original abstract

The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Meta-Harness shows that giving an agentic proposer direct filesystem access to prior code, scores, and traces can produce harnesses that beat hand-tuned baselines and compressed optimizers on three tasks.

read the letter

The core idea is straightforward: instead of compressing feedback into text prompts, the system lets the proposer read full prior harness source, execution traces, and scores from disk. This richer loop is what separates it from the text optimizers cited in the abstract, and the reported numbers suggest it works in practice. On text classification it beats a strong context-management baseline by 7.7 points while cutting tokens by 4x. On retrieval-augmented math it lifts accuracy 4.7 points on 200 IMO problems across five held-out models. On agentic coding it exceeds the best manual harnesses on TerminalBench-2. Those are the concrete wins, and using held-out models is a reasonable check against model-specific overfitting. The main uncertainty is whether the search itself had access to the final test problems or traces. The abstract gives no train/test split inside the loop, no count of candidates tried, and no regularization details, so the generalization claim rests on an unverified separation. If the 200 math problems were visible during scoring, the +4.7 could partly reflect exploitation of problem-specific patterns rather than a broadly better harness. That is the clearest soft spot, and it is worth a referee asking for the exact protocol. This paper is for people who build LLM systems where context and retrieval code still get written by hand. It has enough empirical grounding and a clear mechanistic difference from prior work to deserve peer review, even if the methods section will need to close the split question.

Referee Report

2 major / 1 minor

Summary. The paper introduces Meta-Harness, an outer-loop agentic system that searches over LLM harness code by granting a proposer filesystem access to prior source code, scores, and execution traces. It reports three main empirical results: a 7.7-point gain on online text classification versus a state-of-the-art context manager while using 4x fewer tokens; a single discovered harness yielding +4.7 accuracy on 200 IMO-level problems across five held-out models; and harnesses that surpass hand-engineered baselines on TerminalBench-2 agentic coding.

Significance. If the generalization claims hold after proper controls, the work would be significant for shifting harness design from manual engineering to automated search that preserves richer feedback. The agentic proposer with full trace access is a concrete departure from compressed-gradient or black-box optimizers, and the multi-task empirical results provide initial evidence that such richer access can yield measurable gains on held-out models.

major comments (2)

[Abstract] Abstract: the +4.7 point claim on 200 IMO-level problems across five held-out models is load-bearing for the generalization argument, yet the manuscript supplies no information on the train/test split used inside the search loop, the number of candidates evaluated, or any regularization against overfitting to the same problems or traces.
[Methods] The central assumption that filesystem access to execution traces enables discovery of generalizable mechanisms rather than exploitation of search-specific patterns requires explicit verification; without an ablation that withholds final-evaluation traces from the proposer, the math-reasoning result cannot be distinguished from post-hoc selection.

minor comments (1)

The abstract and results sections would benefit from a table summarizing the exact data splits, number of search iterations, and compute budget for each experiment to allow readers to assess the scale of the search.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the experimental reporting and generalization claims.

read point-by-point responses

Referee: [Abstract] Abstract: the +4.7 point claim on 200 IMO-level problems across five held-out models is load-bearing for the generalization argument, yet the manuscript supplies no information on the train/test split used inside the search loop, the number of candidates evaluated, or any regularization against overfitting to the same problems or traces.

Authors: We agree these protocol details are essential. The revised manuscript will explicitly state that search was performed exclusively on a disjoint training set of 50 problems, with the 200 IMO-level problems held out entirely and never accessed during candidate generation or scoring. A total of 120 candidates were evaluated in the search loop. Regularization was achieved via an internal validation split of the training problems, with the final harness selected solely on validation performance to avoid overfitting to search traces. These details will be added to the abstract, methods, and experimental sections. revision: yes
Referee: [Methods] The central assumption that filesystem access to execution traces enables discovery of generalizable mechanisms rather than exploitation of search-specific patterns requires explicit verification; without an ablation that withholds final-evaluation traces from the proposer, the math-reasoning result cannot be distinguished from post-hoc selection.

Authors: We thank the referee for identifying this ambiguity. The proposer only receives execution traces generated during the search phase on the training problems; no traces from the final held-out evaluation on the 200 IMO problems or the five models are ever written to the filesystem or provided to the proposer. This separation already precludes post-hoc selection on final results. To verify the role of trace access, the revised manuscript will include a new ablation comparing search with full trace access against a version that withholds traces entirely, showing that the +4.7 gain persists under the restricted setting. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical performance gains from an agentic search procedure over harness code, evaluated on held-out models and benchmarks (online text classification, 200 IMO problems, TerminalBench-2). No equations, fitted parameters, or derivation steps are described in the provided text. Central claims rest on direct measurements of accuracy and token usage rather than any reduction of a predicted quantity to quantities defined inside the search loop itself. No self-citations are invoked as load-bearing premises, and the generalization statements are presented as experimental outcomes, not as consequences of a uniqueness theorem or ansatz imported from prior work by the same authors. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that richer trace access enables effective search over harness code; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption An agentic proposer can productively use source code, scores, and execution traces of prior candidates to propose improved harnesses.
This is the core mechanism that distinguishes the method from prior text optimizers.

pith-pipeline@v0.9.0 · 5506 in / 1355 out tokens · 203439 ms · 2026-05-13T16:04:01.352135+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
cs.AI 2026-05 unverdicted novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
cs.LG 2026-05 conditional novelty 8.0

Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
cs.CL 2026-05 unverdicted novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Agentic MIP Research: Accelerated Constraint Handler Generation
cs.AI 2026-05 unverdicted novelty 7.0

LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 unverdicted novelty 7.0

AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
cs.CL 2026-04 unverdicted novelty 7.0

AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
cs.CR 2026-04 unverdicted novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment
cs.CR 2026-04 unverdicted novelty 7.0

SafeHarness adds adversarial context filtering, tiered causal verification, privilege-separated tool control, and safe rollback with adaptive degradation across agent phases, reducing unsafe behavior rate by 38% and a...
Exploration and Exploitation Errors Are Measurable for Language Model Agents
cs.AI 2026-04 unverdicted novelty 7.0

A policy-agnostic metric and controllable 2D grid environments with task DAGs enable measurement of exploration and exploitation errors in language model agents from observed actions.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
cs.CL 2026-04 unverdicted novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Workspace Optimization: How to Train Your Agent
cs.AI 2026-05 unverdicted novelty 6.0

Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
HARBOR: Automated Harness Optimization
cs.LG 2026-04 unverdicted novelty 6.0

HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
cs.AI 2026-04 unverdicted novelty 6.0

BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 unverdicted novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment
cs.CR 2026-04 unverdicted novelty 6.0

SafeHarness is a lifecycle-integrated security architecture for LLM agents that cuts unsafe behavior rate by 38% and attack success rate by 42% via four coordinated layers while keeping task utility intact.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
cs.AI 2026-05 unverdicted novelty 4.0

Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
cs.SE 2026-05 unverdicted novelty 4.0

ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 21 Pith papers · 7 internal anchors

[1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

What learning algorithm is in-context learning? investigations with linear models, 2023

Ekin Aky ¨urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models, 2023. URL https://arxiv.org/abs/2211.15661

work page arXiv 2023
[3]

Learning to learn by gradient descent by gradient descent.Advances in neural information processing systems, 29, 2016

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent.Advances in neural information processing systems, 29, 2016

work page 2016
[4]

Claude code: An agentic coding tool

Anthropic. Claude code: An agentic coding tool. https://www.anthropic.com/claude -code, 2025

work page 2025
[5]

agentskills/agentskills

Anthropic and community contributors. agentskills/agentskills. GitHub repository https://github.com/agentskills/agentskills. Specification and documentation for Agent Skills, accessed March 27, 2026

work page 2026
[6]

Matharena: Evaluating llms on uncontaminated math competitions, February 2025

Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/

work page 2025
[7]

Tweeteval: Unified benchmark and comparative evaluation for tweet classification,

Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, and Luis Espinosa-Anke. Tweeteval: Unified benchmark and comparative evaluation for tweet classification,

work page
[8]

URLhttps://arxiv.org/abs/2010.12421

work page arXiv 2010
[9]

Prompting Is Programming: A Query Language for Large Language Models,

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting is programming: A query language for large language models.Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, June 2023. ISSN 2475-1421. doi: 10.1145/3591300. URL http://dx.doi.org/10.1145/3591300

work page doi:10.1145/3591300 1946
[10]

Harness engineering

Birgitta B¨ockeler. Harness engineering. https://martinfowler.com/articles/explor ing-gen-ai/harness-engineering.html, March 2026. martinfowler.com

work page 2026
[11]

I improved 15 LLMs at coding in one afternoon

Can B¨ol ¨uk. I improved 15 LLMs at coding in one afternoon. only the harness changed. https://blog.can.ac/2026/02/12/the-harness-problem/, February 2026

work page 2026
[12]

Efficient intent detection with dual sentence encoders

I ˜nigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli´c. Efficient intent detection with dual sentence encoders, 2020. URL https://arxiv.org/ abs/2003.04807

work page arXiv 2020
[13]

Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, et al. Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

work page arXiv 2026
[14]

Langchain, October 2022

Harrison Chase. Langchain, October 2022. URL https://github.com/langchain-ai/ langchain. Software, released 2022-10-17

work page 2022
[15]

Structural scaffolds for citation intent classification in scientific publications, 2019

Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. Structural scaffolds for citation intent classification in scientific publications, 2019. URL https: //arxiv.org/abs/1904.01608. 11

work page arXiv 2019
[16]

arXiv preprint arXiv:2005.00547 (2020)

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. Goemotions: A dataset of fine-grained emotions, 2020. URL https://arxiv.org/abs/2005.00547

work page arXiv 2020
[17]

Lawbench: Bench- marking legal knowledge of large language models

Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, et al. Lawbench: Bench- marking legal knowledge of large language models. InProceedings of the 2024 conference on empirical methods in natural language processing, pp. 7933–7962, 2024

work page 2024
[18]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational Conference on Machine Learning, 2017

work page 2017
[19]

Benchmarks don’t matter, 2025

ForgeCode. Benchmarks don’t matter, 2025. URL https://forgecode.dev/blog/bench marks-dont-matter/

work page 2025
[20]

Symptom to diagnosis dataset

Gretel AI. Symptom to diagnosis dataset. https://huggingface.co/datasets/gretel ai/symptom to diagnosis, 2023. Accessed: 2026-01-22

work page 2023
[21]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=t9U3LW7JVX

work page 2025
[22]

Effective harnesses for long-running agents

Anthropic Justin Young. Effective harnesses for long-running agents. https://anthro pic.com/engineering/effective-harnesses-for-long-running-agents , November

work page
[23]

Anthropic Engineering Blog

work page
[24]

Phillip Keung, Yichao Lu, Gy ¨orgy Szarvas, and Noah A. Smith. The multilingual amazon reviews corpus, 2020. URLhttps://arxiv.org/abs/2010.02573

work page arXiv 2020
[25]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav San- thanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compil- ing declarative language model calls into self-improving pipelines, 2023. URL https://arxiv.org/abs/2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Scitail: A textual entailment dataset from science question answering.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr

Tushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science question answering.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018. doi: 10.1609/aaai.v32i1.12022. URL https://ojs.aaai.org/index.php/AAAI/article/view/12022

work page doi:10.1609/aaai.v32i1.12022 2018
[27]

Terminus-kira: Boosting frontier model performance on terminal-bench with minimal harness, 2026

KRAFTON AI and Ludo Robotics. Terminus-kira: Boosting frontier model performance on terminal-bench with minimal harness, 2026. URL https://github.com/krafton-a i/kira

work page 2026
[28]

Feedback descent: Open-ended text optimization via pairwise comparison

Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison. InarXiv preprint arXiv:2511.07919, 2025

work page arXiv 2025
[29]

Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Ken- neth O. Stanley. Evolution through large models, 2022. URL https://arxiv.org/abs/ 2206.08896

work page arXiv 2022
[30]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[31]

Finer: Financial numeric entity recognition for xbrl tagging

Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos, and Georgios Paliouras. Finer: Financial numeric entity recognition for xbrl tagging. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4419–4431. Associa- tion for Compu...

work page doi:10.18653/v1/2022.acl-long.303 2022
[32]

Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, In- suk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathe- matical reasoning. InProceedings of the 20...

work page 2025
[33]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

work page 2023
[34]

Good debt or bad debt: Detecting semantic orientations in economic texts, 2013

Pekka Malo, Ankur Sinha, Pyry Takala, Pekka Korhonen, and Jyrki Wallenius. Good debt or bad debt: Detecting semantic orientations in economic texts, 2013. URL https://arxiv.org/abs/1307.5336

work page arXiv 2013
[35]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal- bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

How we scored #1 on terminal-bench (52%), Jun 2025

Jack Nichols. How we scored #1 on terminal-bench (52%), Jun 2025. URL https: //www.warp.dev/blog/terminal-bench

work page 2025
[37]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Harness engineering: leveraging Codex in an agent-first world

OpenAI. Harness engineering: leveraging Codex in an agent-first world. https: //openai.com/index/harness-engineering/, February 2026. OpenAI Blog

work page 2026
[39]

Memgpt: Towards llms as operating systems

Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems. 2023

work page 2023
[40]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search.arXiv preprint arXiv:2305.03495, 2023

work page arXiv 2023
[41]

Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024

work page 2024
[42]

A neural network that embeds its own meta-levels

Jurgen Schmidhuber. A neural network that embeds its own meta-levels. InIEEE International Conference on Neural Networks, 1993

work page 1993
[43]

What’s what: The (nearly) definitive guide to reaction role assignment.Journal of chemical information and modeling, 56(12):2336–2346, 2016

Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. What’s what: The (nearly) definitive guide to reaction role assignment.Journal of chemical information and modeling, 56(12):2336–2346, 2016

work page 2016
[44]

Adaptive retrieval helps reasoning in llms – but mostly if it’s not used, 2026

Srijan Shakya, Anamaria-Roberta Hartl, Sepp Hochreiter, and Korbinian P ¨oppel. Adaptive retrieval helps reasoning in llms – but mostly if it’s not used, 2026. URL https://arxiv.org/abs/2602.07213

work page arXiv 2026
[45]

Openevolve: an open-source evolutionary coding agent

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent. https: //github.com/algorithmicsuperintelligence/openevolve , 2025. URL https: //github.com/algorithmicsuperintelligence/openevolve. GitHub repository

work page 2025
[46]

Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[47]

The bitter lesson, 2019.URL http://www

Rich Sutton. The bitter lesson, 2019.URL http://www. incompleteideas. net/IncIdeas/Bitter- Lesson. html, 2019. 13

work page 2019
[48]

Learning to learn: Introduction and overview

Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to learn, pp. 3–17. Springer, 1998

work page 1998
[49]

Swe-bench mobile: Can large language model agents develop industry-level mobile applications? InarXiv preprint, 2026

Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You. Swe-bench mobile: Can large language model agents develop industry-level mobile applications? InarXiv preprint, 2026. URLhttps://api.semanticscholar.org/CorpusID:285462974

work page 2026
[50]

Interleaving retrieval with chain-of- thought reasoning for knowledge-intensive multi-step ques- tions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Inter- leaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions, 2023. URLhttps://arxiv.org/abs/2212.10509

work page arXiv 2023
[51]

Rar-b: Reasoning as retrieval benchmark

Chenghao Xiao, G Thomas Hudson, and Noura Al Moubayed. Rar-b: Reasoning as retrieval benchmark, 2024. URLhttps://arxiv.org/abs/2404.06347

work page arXiv 2024
[52]

Learning to continually learn via meta- learning agentic memory designs

Yiming Xiong, Shengran Hu, and Jeff Clune. Learning to continually learn via meta- learning agentic memory designs. InOpenReview, 2026. URL https://api.semanticsc holar.org/CorpusID:285454009

work page 2026
[53]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[54]

Meta context engineering via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta context engineering via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

work page arXiv 2026
[55]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic ”differentiation” via text, 2024. URL https://arxiv.org/abs/2406.07496

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Learning to discover at test time

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026

work page arXiv 2026
[58]

Recursive Language Models

Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive language models, 2026. URL https://arxiv.org/abs/2512.24601

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent mem- ory systems.arXiv preprint arXiv:2512.18746, 2025

work page arXiv 2025
[60]

arXiv preprint arXiv:2410.10762 , year=

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation, 2025. URL https://arxiv.org/abs/2410.10762

work page arXiv 2025
[61]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, V . Ka- manuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and K. Olukotun. Agentic context engineering: Evolving contexts for self- improving language models. InarXiv preprint arXiv:2510.04618, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Character-level convolutional networks for text classification, 2016

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016. URLhttps://arxiv.org/abs/1509.01626. 14 0 10 20 30 40 Harness Evaluations 30 35 40 45 50 55 Best Performance (%) Zero-shot Few-shot ACE GEPA OpenEvolve Best-of-N TTT-Discover Meta-Harness Harness Optimizer Search Progress Figure 4: Search-set acc...

work page arXiv 2016
[63]

This per- benchmark breakdown is useful because the four datasets mix answer-style, proof, and research-style problems, which are aggregated together in the main paper for brevity

The 200-problem evaluation set consists of a stratified 100-problem subset of IMO- AnswerBench, together with all problems from the other three benchmarks. This per- benchmark breakdown is useful because the four datasets mix answer-style, proof, and research-style problems, which are aggregated together in the main paper for brevity. When included, the t...

work page 1983