Recognition: 2 theorem links
· Lean TheoremMeta-Harness: End-to-End Optimization of Model Harnesses
Pith reviewed 2026-05-13 16:04 UTC · model grok-4.3
The pith
Meta-Harness automates search over LLM harness code and beats hand-designed systems on classification, math, and coding tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Meta-Harness is an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification this yields a 7.7-point gain over a state-of-the-art context management system while using 4x fewer context tokens. On retrieval-augmented math reasoning a single discovered harness raises accuracy by 4.7 points on average across five held-out models on 200 IMO-level problems. On agentic coding the discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2.
What carries the argument
An agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem.
If this is right
- Discovered harnesses improve accuracy while cutting context tokens on classification tasks.
- A single harness transfers across multiple held-out models on math reasoning.
- Automated search exceeds hand-engineered baselines on agentic coding benchmarks.
- Full access to prior execution traces supports more effective exploration than compressed feedback.
Where Pith is reading between the lines
- Teams could move from writing harness code to supervising automated searches for each new application.
- The same access pattern might optimize other LLM system components such as retrieval modules or prompt structures.
- Execution-history access appears necessary for scaling automated engineering of complex AI software.
- Testing whether gains persist when models or task distributions shift after search would clarify long-term utility.
Load-bearing premise
An agentic proposer given filesystem access to prior source code, scores, and execution traces can reliably explore harness code and produce generalizable improvements without excessive compute or overfitting.
What would settle it
If harnesses discovered by the system show no accuracy gain or token savings on new tasks and models outside the original search distribution, or if total compute exceeds that of manual design, the claim of reliable automated improvement would not hold.
read the original abstract
The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Meta-Harness, an outer-loop agentic system that searches over LLM harness code by granting a proposer filesystem access to prior source code, scores, and execution traces. It reports three main empirical results: a 7.7-point gain on online text classification versus a state-of-the-art context manager while using 4x fewer tokens; a single discovered harness yielding +4.7 accuracy on 200 IMO-level problems across five held-out models; and harnesses that surpass hand-engineered baselines on TerminalBench-2 agentic coding.
Significance. If the generalization claims hold after proper controls, the work would be significant for shifting harness design from manual engineering to automated search that preserves richer feedback. The agentic proposer with full trace access is a concrete departure from compressed-gradient or black-box optimizers, and the multi-task empirical results provide initial evidence that such richer access can yield measurable gains on held-out models.
major comments (2)
- [Abstract] Abstract: the +4.7 point claim on 200 IMO-level problems across five held-out models is load-bearing for the generalization argument, yet the manuscript supplies no information on the train/test split used inside the search loop, the number of candidates evaluated, or any regularization against overfitting to the same problems or traces.
- [Methods] The central assumption that filesystem access to execution traces enables discovery of generalizable mechanisms rather than exploitation of search-specific patterns requires explicit verification; without an ablation that withholds final-evaluation traces from the proposer, the math-reasoning result cannot be distinguished from post-hoc selection.
minor comments (1)
- The abstract and results sections would benefit from a table summarizing the exact data splits, number of search iterations, and compute budget for each experiment to allow readers to assess the scale of the search.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the experimental reporting and generalization claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the +4.7 point claim on 200 IMO-level problems across five held-out models is load-bearing for the generalization argument, yet the manuscript supplies no information on the train/test split used inside the search loop, the number of candidates evaluated, or any regularization against overfitting to the same problems or traces.
Authors: We agree these protocol details are essential. The revised manuscript will explicitly state that search was performed exclusively on a disjoint training set of 50 problems, with the 200 IMO-level problems held out entirely and never accessed during candidate generation or scoring. A total of 120 candidates were evaluated in the search loop. Regularization was achieved via an internal validation split of the training problems, with the final harness selected solely on validation performance to avoid overfitting to search traces. These details will be added to the abstract, methods, and experimental sections. revision: yes
-
Referee: [Methods] The central assumption that filesystem access to execution traces enables discovery of generalizable mechanisms rather than exploitation of search-specific patterns requires explicit verification; without an ablation that withholds final-evaluation traces from the proposer, the math-reasoning result cannot be distinguished from post-hoc selection.
Authors: We thank the referee for identifying this ambiguity. The proposer only receives execution traces generated during the search phase on the training problems; no traces from the final held-out evaluation on the 200 IMO problems or the five models are ever written to the filesystem or provided to the proposer. This separation already precludes post-hoc selection on final results. To verify the role of trace access, the revised manuscript will include a new ablation comparing search with full trace access against a version that withholds traces entirely, showing that the +4.7 gain persists under the restricted setting. revision: partial
Circularity Check
No significant circularity
full rationale
The paper reports empirical performance gains from an agentic search procedure over harness code, evaluated on held-out models and benchmarks (online text classification, 200 IMO problems, TerminalBench-2). No equations, fitted parameters, or derivation steps are described in the provided text. Central claims rest on direct measurements of accuracy and token usage rather than any reduction of a predicted quantity to quantities defined inside the search loop itself. No self-citations are invoked as load-bearing premises, and the generalization statements are presented as experimental outcomes, not as consequences of a uniqueness theorem or ansatz imported from prior work by the same authors. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An agentic proposer can productively use source code, scores, and execution traces of prior candidates to propose improved harnesses.
Forward citations
Cited by 23 Pith papers
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
-
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
Agentic MIP Research: Accelerated Constraint Handler Generation
LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
-
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
-
SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment
SafeHarness adds adversarial context filtering, tiered causal verification, privilege-separated tool control, and safe rollback with adaptive degradation across agent phases, reducing unsafe behavior rate by 38% and a...
-
Exploration and Exploitation Errors Are Measurable for Language Model Agents
A policy-agnostic metric and controllable 2D grid environments with task DAGs enable measurement of exploration and exploitation errors in language model agents from observed actions.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
Workspace Optimization: How to Train Your Agent
Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.
-
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
-
HARBOR: Automated Harness Optimization
HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.
-
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
-
SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment
SafeHarness is a lifecycle-integrated security architecture for LLM agents that cuts unsafe behavior rate by 38% and attack success rate by 42% via four coordinated layers while keeping task utility intact.
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
-
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
-
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.
Reference graph
Works this paper leans on
-
[1]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
What learning algorithm is in-context learning? investigations with linear models, 2023
Ekin Aky ¨urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models, 2023. URL https://arxiv.org/abs/2211.15661
-
[3]
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent.Advances in neural information processing systems, 29, 2016
work page 2016
-
[4]
Claude code: An agentic coding tool
Anthropic. Claude code: An agentic coding tool. https://www.anthropic.com/claude -code, 2025
work page 2025
-
[5]
Anthropic and community contributors. agentskills/agentskills. GitHub repository https://github.com/agentskills/agentskills. Specification and documentation for Agent Skills, accessed March 27, 2026
work page 2026
-
[6]
Matharena: Evaluating llms on uncontaminated math competitions, February 2025
Mislav Balunovi´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/
work page 2025
-
[7]
Tweeteval: Unified benchmark and comparative evaluation for tweet classification,
Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, and Luis Espinosa-Anke. Tweeteval: Unified benchmark and comparative evaluation for tweet classification,
- [8]
-
[9]
Prompting Is Programming: A Query Language for Large Language Models,
Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting is programming: A query language for large language models.Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, June 2023. ISSN 2475-1421. doi: 10.1145/3591300. URL http://dx.doi.org/10.1145/3591300
-
[10]
Birgitta B¨ockeler. Harness engineering. https://martinfowler.com/articles/explor ing-gen-ai/harness-engineering.html, March 2026. martinfowler.com
work page 2026
-
[11]
I improved 15 LLMs at coding in one afternoon
Can B¨ol ¨uk. I improved 15 LLMs at coding in one afternoon. only the harness changed. https://blog.can.ac/2026/02/12/the-harness-problem/, February 2026
work page 2026
-
[12]
Efficient intent detection with dual sentence encoders
I ˜nigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli´c. Efficient intent detection with dual sentence encoders, 2020. URL https://arxiv.org/ abs/2003.04807
-
[13]
Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026
Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, et al. Adae- volve: Adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026
-
[14]
Harrison Chase. Langchain, October 2022. URL https://github.com/langchain-ai/ langchain. Software, released 2022-10-17
work page 2022
-
[15]
Structural scaffolds for citation intent classification in scientific publications, 2019
Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. Structural scaffolds for citation intent classification in scientific publications, 2019. URL https: //arxiv.org/abs/1904.01608. 11
-
[16]
arXiv preprint arXiv:2005.00547 (2020)
Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. Goemotions: A dataset of fine-grained emotions, 2020. URL https://arxiv.org/abs/2005.00547
-
[17]
Lawbench: Bench- marking legal knowledge of large language models
Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, et al. Lawbench: Bench- marking legal knowledge of large language models. InProceedings of the 2024 conference on empirical methods in natural language processing, pp. 7933–7962, 2024
work page 2024
-
[18]
Model-agnostic meta-learning for fast adaptation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational Conference on Machine Learning, 2017
work page 2017
-
[19]
ForgeCode. Benchmarks don’t matter, 2025. URL https://forgecode.dev/blog/bench marks-dont-matter/
work page 2025
-
[20]
Gretel AI. Symptom to diagnosis dataset. https://huggingface.co/datasets/gretel ai/symptom to diagnosis, 2023. Accessed: 2026-01-22
work page 2023
-
[21]
Automated design of agentic systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=t9U3LW7JVX
work page 2025
-
[22]
Effective harnesses for long-running agents
Anthropic Justin Young. Effective harnesses for long-running agents. https://anthro pic.com/engineering/effective-harnesses-for-long-running-agents , November
-
[23]
Anthropic Engineering Blog
- [24]
-
[25]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav San- thanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compil- ing declarative language model calls into self-improving pipelines, 2023. URL https://arxiv.org/abs/2310.03714
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Tushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science question answering.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018. doi: 10.1609/aaai.v32i1.12022. URL https://ojs.aaai.org/index.php/AAAI/article/view/12022
-
[27]
Terminus-kira: Boosting frontier model performance on terminal-bench with minimal harness, 2026
KRAFTON AI and Ludo Robotics. Terminus-kira: Boosting frontier model performance on terminal-bench with minimal harness, 2026. URL https://github.com/krafton-a i/kira
work page 2026
-
[28]
Feedback descent: Open-ended text optimization via pairwise comparison
Yoonho Lee, Joseph Boen, and Chelsea Finn. Feedback descent: Open-ended text optimization via pairwise comparison. InarXiv preprint arXiv:2511.07919, 2025
- [29]
-
[30]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[31]
Finer: Financial numeric entity recognition for xbrl tagging
Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos, and Georgios Paliouras. Finer: Financial numeric entity recognition for xbrl tagging. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4419–4431. Associa- tion for Compu...
-
[32]
Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, In- suk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathe- matical reasoning. InProceedings of the 20...
work page 2025
-
[33]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023
work page 2023
-
[34]
Good debt or bad debt: Detecting semantic orientations in economic texts, 2013
Pekka Malo, Ankur Sinha, Pyry Takala, Pekka Korhonen, and Jyrki Wallenius. Good debt or bad debt: Detecting semantic orientations in economic texts, 2013. URL https://arxiv.org/abs/1307.5336
-
[35]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal- bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
How we scored #1 on terminal-bench (52%), Jun 2025
Jack Nichols. How we scored #1 on terminal-bench (52%), Jun 2025. URL https: //www.warp.dev/blog/terminal-bench
work page 2025
-
[37]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Harness engineering: leveraging Codex in an agent-first world
OpenAI. Harness engineering: leveraging Codex in an agent-first world. https: //openai.com/index/harness-engineering/, February 2026. OpenAI Blog
work page 2026
-
[39]
Memgpt: Towards llms as operating systems
Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems. 2023
work page 2023
-
[40]
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search.arXiv preprint arXiv:2305.03495, 2023
-
[41]
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024
work page 2024
-
[42]
A neural network that embeds its own meta-levels
Jurgen Schmidhuber. A neural network that embeds its own meta-levels. InIEEE International Conference on Neural Networks, 1993
work page 1993
-
[43]
Nadine Schneider, Nikolaus Stiefl, and Gregory A Landrum. What’s what: The (nearly) definitive guide to reaction role assignment.Journal of chemical information and modeling, 56(12):2336–2346, 2016
work page 2016
-
[44]
Adaptive retrieval helps reasoning in llms – but mostly if it’s not used, 2026
Srijan Shakya, Anamaria-Roberta Hartl, Sepp Hochreiter, and Korbinian P ¨oppel. Adaptive retrieval helps reasoning in llms – but mostly if it’s not used, 2026. URL https://arxiv.org/abs/2602.07213
-
[45]
Openevolve: an open-source evolutionary coding agent
Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent. https: //github.com/algorithmicsuperintelligence/openevolve , 2025. URL https: //github.com/algorithmicsuperintelligence/openevolve. GitHub repository
work page 2025
-
[46]
Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot learning. InAdvances in Neural Information Processing Systems, 2017
work page 2017
-
[47]
The bitter lesson, 2019.URL http://www
Rich Sutton. The bitter lesson, 2019.URL http://www. incompleteideas. net/IncIdeas/Bitter- Lesson. html, 2019. 13
work page 2019
-
[48]
Learning to learn: Introduction and overview
Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to learn, pp. 3–17. Springer, 1998
work page 1998
-
[49]
Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, and Jiaxuan You. Swe-bench mobile: Can large language model agents develop industry-level mobile applications? InarXiv preprint, 2026. URLhttps://api.semanticscholar.org/CorpusID:285462974
work page 2026
-
[50]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Inter- leaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions, 2023. URLhttps://arxiv.org/abs/2212.10509
-
[51]
Rar-b: Reasoning as retrieval benchmark
Chenghao Xiao, G Thomas Hudson, and Noura Al Moubayed. Rar-b: Reasoning as retrieval benchmark, 2024. URLhttps://arxiv.org/abs/2404.06347
-
[52]
Learning to continually learn via meta- learning agentic memory designs
Yiming Xiong, Shengran Hu, and Jeff Clune. Learning to continually learn via meta- learning agentic memory designs. InOpenReview, 2026. URL https://api.semanticsc holar.org/CorpusID:285454009
work page 2026
-
[53]
Large language models as optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[54]
Meta context engineering via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026
Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta context engineering via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026
-
[55]
TextGrad: Automatic "Differentiation" via Text
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic ”differentiation” via text, 2024. URL https://arxiv.org/abs/2406.07496
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Learning to discover at test time
Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026
-
[58]
Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive language models, 2026. URL https://arxiv.org/abs/2512.24601
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025
Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent mem- ory systems.arXiv preprint arXiv:2512.18746, 2025
-
[60]
arXiv preprint arXiv:2410.10762 , year=
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation, 2025. URL https://arxiv.org/abs/2410.10762
-
[61]
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, V . Ka- manuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and K. Olukotun. Agentic context engineering: Evolving contexts for self- improving language models. InarXiv preprint arXiv:2510.04618, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Character-level convolutional networks for text classification, 2016
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016. URLhttps://arxiv.org/abs/1509.01626. 14 0 10 20 30 40 Harness Evaluations 30 35 40 45 50 55 Best Performance (%) Zero-shot Few-shot ACE GEPA OpenEvolve Best-of-N TTT-Discover Meta-Harness Harness Optimizer Search Progress Figure 4: Search-set acc...
-
[63]
The 200-problem evaluation set consists of a stratified 100-problem subset of IMO- AnswerBench, together with all problems from the other three benchmarks. This per- benchmark breakdown is useful because the four datasets mix answer-style, proof, and research-style problems, which are aggregated together in the main paper for brevity. When included, the t...
work page 1983
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.