hub Canonical reference

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn · 2026 · cs.AI · arXiv 2603.28052

Canonical reference. 83% of citing Pith papers cite this work as background.

34 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 34 citing papers arXiv PDF

abstract

The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12

citation-polarity summary

background 10 support 2

representative citing papers

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

cs.LG · 2026-05-11 · conditional · novelty 8.0

Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

cs.CL · 2026-05-08 · conditional · novelty 8.0 · 2 refs

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

Agentic MIP Research: Accelerated Constraint Handler Generation

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

cs.CL · 2026-04-28 · unverdicted · novelty 7.0 · 2 refs

AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with cross-benchmark and cross-model transfer.

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.

Exploration and Exploitation Errors Are Measurable for Language Model Agents

cs.AI · 2026-04-14 · unverdicted · novelty 7.0

A policy-agnostic metric and controllable 2D grid environments with task DAGs enable measurement of exploration and exploitation errors in language model agents from observed actions.

FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Introduces BeliefTrack benchmark diagnosing three CBM failures in LLMs and shows RL with belief-state rewards cuts failure rates by 70.9% while representation steering cuts them by 46.1%.

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.

Harnesses for Inference-Time Alignment over Execution Trajectories

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal benchmarks.

Workspace Optimization: How to Train Your Agent

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.

FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.

HARBOR: Automated Harness Optimization

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.

SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

cs.CR · 2026-04-15 · unverdicted · novelty 6.0 · 2 refs

SafeHarness is a lifecycle-integrated security architecture for LLM agents that cuts unsafe behavior rate by 38% and attack success rate by 42% via four coordinated layers while keeping task utility intact.

MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

cs.AI · 2026-05-19 · conditional · novelty 5.0

MOCHA combines Chebyshev scalarization with exponential annealing to optimize LLM agent skills across performance and platform constraints, improving mean correctness by 7.5% over baselines on six tasks while finding more Pareto-optimal variants.

Code as Agent Harness

cs.CL · 2026-05-18 · accept · novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.

Nautilus: From One Prompt to Plug-and-Play Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

citing papers explorer

Showing 34 of 34 citing papers.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 42 · internal anchor
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Continual Harness: Online Adaptation for Self-Improving Foundation Agents cs.LG · 2026-05-11 · conditional · none · ref 10 · internal anchor
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling cs.CL · 2026-05-08 · conditional · none · ref 20 · 2 links · internal anchor
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition cs.CL · 2026-05-12 · unverdicted · none · ref 71 · internal anchor
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory cs.AI · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Agentic MIP Research: Accelerated Constraint Handler Generation cs.AI · 2026-05-09 · unverdicted · none · ref 9 · internal anchor
LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch cs.AI · 2026-05-05 · unverdicted · none · ref 60 · internal anchor
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses cs.CL · 2026-04-28 · unverdicted · none · ref 17 · 2 links · internal anchor
AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with cross-benchmark and cross-model transfer.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery cs.CR · 2026-04-22 · unverdicted · none · ref 25 · internal anchor
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
Exploration and Exploitation Errors Are Measurable for Language Model Agents cs.AI · 2026-04-14 · unverdicted · none · ref 5 · internal anchor
A policy-agnostic metric and controllable 2D grid environments with task DAGs enable measurement of exploration and exploitation errors in language model agents from observed actions.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks cs.CL · 2026-04-07 · unverdicted · none · ref 18 · internal anchor
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement cs.CL · 2026-06-10 · unverdicted · none · ref 130 · internal anchor
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
When Should Models Change Their Minds? Contextual Belief Management in Large Language Models cs.AI · 2026-05-28 · unverdicted · none · ref 20 · internal anchor
Introduces BeliefTrack benchmark diagnosing three CBM failures in LLMs and shows RL with belief-state rewards cuts failure rates by 70.9% while representation steering cuts them by 46.1%.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 20 · internal anchor
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale cs.LG · 2026-05-20 · unverdicted · none · ref 23 · internal anchor
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
Harnesses for Inference-Time Alignment over Execution Trajectories cs.LG · 2026-05-15 · unverdicted · none · ref 46 · internal anchor
Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal benchmarks.
Workspace Optimization: How to Train Your Agent cs.AI · 2026-05-10 · unverdicted · none · ref 2 · internal anchor
Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration cs.LG · 2026-05-08 · unverdicted · none · ref 14 · internal anchor
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.
HARBOR: Automated Harness Optimization cs.LG · 2026-04-22 · unverdicted · none · ref 26 · internal anchor
HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs cs.AI · 2026-04-20 · unverdicted · none · ref 26 · internal anchor
BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.
SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment cs.CR · 2026-04-15 · unverdicted · none · ref 2 · 2 links · internal anchor
SafeHarness is a lifecycle-integrated security architecture for LLM agents that cuts unsafe behavior rate by 38% and attack success rate by 42% via four coordinated layers while keeping task utility intact.
MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization cs.AI · 2026-05-19 · conditional · none · ref 15 · internal anchor
MOCHA combines Chebyshev scalarization with exponential annealing to optimize LLM agent skills across performance and platform constraints, improving mean correctness by 7.5% over baselines on six tasks while finding more Pareto-optimal variants.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 13 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
Nautilus: From One Prompt to Plug-and-Play Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 8 · internal anchor
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
HiLSVA: Design and Evaluation of a Human-in-the-Loop Agentic System for Scientific Visualization cs.HC · 2026-06-25 · unverdicted · none · ref 38 · internal anchor
HiLSVA introduces a plan-first multi-agent LLM system for scientific visualization that incorporates explicit human oversight, stepwise provenance, and learn-at-test-time adaptation, evaluated via case studies and a 12-participant user study.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents cs.AI · 2026-05-11 · unverdicted · none · ref 20 · internal anchor
Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration cs.SE · 2026-05-04 · unverdicted · none · ref 4 · internal anchor
ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.
MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems cs.AI · 2026-05-21 · unreviewed · ref 8 · internal anchor
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents cs.AI · 2026-05-21 · unreviewed · ref 46 · internal anchor
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents cs.AI · 2026-05-20 · unreviewed · ref 9 · 2 links · internal anchor
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems cs.AI · 2026-05-20 · unreviewed · ref 29 · internal anchor
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting cs.LG · 2026-05-19 · unreviewed · ref 8 · internal anchor
Shepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution Traces cs.AI · 2026-05-11 · unreviewed · ref 20 · internal anchor
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents cs.AI · 2026-04-20 · unreviewed · ref 23 · internal anchor

Meta-Harness: End-to-End Optimization of Model Harnesses

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer