Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Ananjan Nandi; Christopher D Manning; Derek Chong; Dilara Soylu; Jiuding Sun; Simon Yu; Weiyan Shi

arxiv: 2605.10913 · v2 · pith:YQHAX3G6new · submitted 2026-05-11 · 💻 cs.AI · cs.PL· cs.SE

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Simon Yu , Derek Chong , Ananjan Nandi , Dilara Soylu , Jiuding Sun , Christopher D Manning , Weiyan Shi This is my paper

Pith reviewed 2026-05-12 03:21 UTC · model grok-4.3

classification 💻 cs.AI cs.PLcs.SE

keywords meta-agentsexecution traceruntime forkingcounterfactual optimizationTree-RLagent supervisionfunctional programming modelGit-like trace

0 comments

The pith

Shepherd formalizes meta-agent operations as functions on a Git-like execution trace that records every interaction for fast forking and replay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Shepherd as a functional programming model that treats meta-agent operations on target agents as functions whose core steps are mechanized in Lean. It records every agent-environment interaction as a typed event inside a Git-like execution trace, so any past state can be forked and replayed without restarting from scratch. Forking the agent process and filesystem runs five times faster than Docker while reusing over 95 percent of cached prompts on replay. Three concrete uses demonstrate the model: a live supervisor raises pair-coding success from 28.8 to 54.7 percent, branching counterfactual search beats baselines by up to 11 points while cutting wall-clock time by up to 58 percent, and selective forking of rollouts lifts Tree-RL performance on TerminalBench-2 from 34.2 to 39.4 percent. These outcomes position the trace and forking mechanism as practical infrastructure for writing and running meta-agents.

Core claim

Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The substrate forks the agent process and its filesystem five times faster than Docker and reuses more than 95 percent of prompt cache on replay. When applied to runtime intervention, counterfactual meta-optimization, and Tree-RL training, the trace produces measurable gains in pass rates, benchmark scores, and training efficiency across the reported tasks.

What carries the argument

The typed execution trace that stores every agent-environment interaction as an event and supports forking of both the agent process and its filesystem.

If this is right

A live supervisor using the trace can raise pair-coding pass rates from 28.8 percent to 54.7 percent on CooperBench.
Branching exploration inside the trace outperforms baselines on four benchmarks by as much as 11 points and reduces wall-clock time by as much as 58 percent.
Forking rollouts at selected turns inside the trace raises TerminalBench-2 performance from 34.2 percent to 39.4 percent.
Any past agent state captured in the trace can be replayed or branched without restarting the full environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trace structure could let developers version and debug ordinary single-agent systems the way git versions code.
High cache reuse on replay suggests the mechanism may scale to longer-horizon agent runs where repeated prompt computation would otherwise dominate cost.
If the Lean mechanization of core operations is extended, it could support machine-checked proofs that certain meta-agent interventions preserve safety properties.

Load-bearing premise

The reported gains in intervention success, optimization scores, and RL performance arise from the trace and forking features rather than from unmeasured differences in experimental setup or implementation.

What would settle it

Run the same three applications with the forking and trace recording disabled while keeping every other component fixed; if the pass-rate, benchmark, and training improvements disappear, the central claim is supported.

Figures

Figures reproduced from arXiv: 2605.10913 by Ananjan Nandi, Christopher D Manning, Derek Chong, Dilara Soylu, Jiuding Sun, Simon Yu, Weiyan Shi.

**Figure 1.** Figure 1: SHEPHERD meta-agents. Top: A supervisor meta-agent manages code repair agents. Bottom: Results from three meta-agents: (A) live supervision; (B) meta-optimization; (C) Tree GRPO Abstract As LLM-based agentic systems grow more complex, they increasingly rely on meta-agents: higher-order agents that act on other agents, much like managers supervise employees. Yet existing agentic runtimes expose execution on… view at source ↗

**Figure 2.** Figure 2: Live intervention experiments on CooperBench, with Claude Haiku 4.5 as worker. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: LiveCodeBench comparison. Left: held-out test pass-rate versus optimization wallclock. Right: dev-set trajectory for each method across optimization wallclock. CRO subtask-cache reuse is reported separately in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: CRO computation reuse on LiveCodeBench rises from ∼1% on the first cold proposer session to over 60%. Setup. We evaluate on subsets of HoVer [13], MATH [8], IFBench [27], LiveCodeBench [11], and TerminalBench 2.0 (TB-2; [24]), comparing CRO against the baseline workflow, GEPA (optimizing workflow code) [2], and MetaHarness [19]. The executor is GPT-5.4-mini and meta-optimizers use GPT-5.4 (in the Codex h… view at source ↗

**Figure 5.** Figure 5: Trajectory compression across two worker model families and two benchmarks. The same [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: HoVer: held-out test pass-rate vs. optimization wallclock (left) and per-iteration dev-set [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗

**Figure 7.** Figure 7: HoVer: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p040_7.png] view at source ↗

**Figure 8.** Figure 8: IFBench: test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗

**Figure 9.** Figure 9: IFBench: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗

**Figure 10.** Figure 10: LiveCodeBench: test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p041_10.png] view at source ↗

**Figure 11.** Figure 11: LiveCodeBench: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗

**Figure 12.** Figure 12: MATH (Level 5): test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p042_12.png] view at source ↗

**Figure 13.** Figure 13: MATH (Level 5): subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗

**Figure 14.** Figure 14: TerminalBench 2.0: test pass-rate vs. wallclock and dev-set trajectory. [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗

**Figure 15.** Figure 15: TerminalBench 2.0: subtask cache reuse per CRO proposer session. [PITH_FULL_IMAGE:figures/full_fig_p043_15.png] view at source ↗

**Figure 16.** Figure 16: GRPO group composition over training (rows: base model; columns: setting). Tree-GRPO [PITH_FULL_IMAGE:figures/full_fig_p047_16.png] view at source ↗

**Figure 17.** Figure 17: Held-out Endless Terminals evaluation, sampled every 10 training steps (raw, unsmoothed). [PITH_FULL_IMAGE:figures/full_fig_p047_17.png] view at source ↗

**Figure 18.** Figure 18: Train raw reward (mean over G=8 roots) for both base models, panels are Qwen3.5- 35B-A3B (left) and Nemotron-3-Super-120B-A12B (right). Tree-GRPO (K=4, teal) reaches higher reward than Flat GRPO (red) at every rollout step. Faint dots are observed steps from the flat-baseline run; smooth lines are denoised trajectories. Case 1: Early mistake (T=4, reward=0.00) Task: Install the requests package and verify… view at source ↗

**Figure 19.** Figure 19: Early-mistake case. The wrong package name on turn 1 dooms the rest of the trajectory. [PITH_FULL_IMAGE:figures/full_fig_p048_19.png] view at source ↗

**Figure 20.** Figure 20: Ambiguous case. At least three turns offer plausible branches (skip-the-version-check, [PITH_FULL_IMAGE:figures/full_fig_p049_20.png] view at source ↗

**Figure 21.** Figure 21: Long-trajectory case. A 9-turn rollout with a wrong-file edit at turn 4 cascades into 5 [PITH_FULL_IMAGE:figures/full_fig_p049_21.png] view at source ↗

read the original abstract

As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that operate on other agents, much as managers supervise employees. Whatever a meta-agent does: coordinating agents, halting risky actions before execution, or repairing failed runs, requires manipulation of agentic execution at runtime. Existing agentic substrates make this hard: they give meta-agents only plain transcripts and environment snapshots, requiring it to build it's own tooling to reconstruct and orchestrate execution state. Therefore, we introduce Shepherd, a Python substrate grounded in functional programming principles, where an agent's execution is itself a first-class object that a meta-agent can inspect and transform. Every model call, tool call, and environment change becomes a structured event in a Git-like execution trace, where any past state can be forked 5x faster than docker commit and replayed. Three example use cases show Shepherd's versatility: (1) a supervisor agent prevents conflicts among parallel coding agents, lifting CooperBench performance from 28.8% to 54.7%; (2) a counterfactual optimizer repairs agent workflows by proposing edits and replaying runs from the point of changed behavior, outperforming MetaHarness on TerminalBench-2 with 58% lower wall-clock; (3) a meta-agent picks fork points during rollouts to improve credit assignment in long-horizon agentic RL, doubling GRPO's gains on TerminalBench-2. We open-source Shepherd to empower future meta-agents with principled and efficient operations over agentic execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shepherd gives meta-agents a functional runtime with Lean checks and fast Git-like forking, but the benchmark gains are presented without enough controls to pin them on those features.

read the letter

Shepherd is a runtime that turns meta-agent actions into functions, mechanizes the basics in Lean, and logs everything in a typed trace that works like Git for forking past states. The forking is 5 times faster than Docker with good cache reuse on replay. The paper shows this in three places. A supervisor uses it to lift pair-coding success from 28.8% to 54.7% on CooperBench. Counterfactual branching beats four benchmarks by as much as 11 points and cuts wall-clock time by up to 58%. Forking rollouts in Tree-RL raises TerminalBench-2 from 34.2% to 39.4%. They also open-source the code. The weak part is the lack of experimental detail. The abstract reports those improvements but says nothing about what the baseline agents were, whether they had similar runtime access, or any ablations that isolate the trace and forking. Without that, it's not clear the gains come from Shepherd rather than other changes in the setup. The Lean part sounds solid on paper but we can't see how much is actually formalized. This is for people who build or study meta-agents and want infrastructure for control and exploration. It deserves a serious referee because the system is new, the use cases are specific, and the claims can be checked once the full experiments and code are reviewed. I'd send it out with a note to add the missing controls and baseline descriptions.

Referee Report

1 major / 0 minor

Summary. The paper introduces Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. It records every agent-environment interaction as a typed event in a Git-like execution trace, enabling forking and replay of past states. The system forks agent processes and filesystems 5× faster than Docker with >95% prompt-cache reuse on replay. It demonstrates the model in three applications: runtime intervention raising pair-coding pass rates from 28.8% to 54.7% on CooperBench; counterfactual meta-optimization outperforming baselines by up to 11 points with up to 58% wall-clock reduction across four benchmarks; and Tree-RL training improving TerminalBench-2 from 34.2% to 39.4%. These results are presented as establishing Shepherd as efficient infrastructure for programming meta-agents, with the system open-sourced.

Significance. If the empirical claims hold under proper controls, Shepherd could provide a useful formalized runtime substrate for meta-agent development, leveraging execution traces for intervention, optimization, and training. The mechanization of core operations in Lean is a clear strength, supplying machine-checked proofs for the model. Open-sourcing the system is also a positive step that supports reproducibility and community follow-on work.

major comments (1)

[Abstract] Abstract: the abstract reports specific performance improvements (pair-coding pass rate 28.8%→54.7% on CooperBench, up to 11-point gains with 58% wall-clock reduction, TerminalBench-2 34.2%→39.4%) but supplies no experimental details, baselines, statistical tests, error bars, or methodology. This prevents verification that the gains are attributable to the typed execution trace and forking features rather than unstated factors, which is load-bearing for the central claim that 'these results establish Shepherd as an efficient infrastructure for programming meta-agents.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of Shepherd's contributions, including the Lean mechanization and open-sourcing, and for the constructive feedback on the abstract. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the abstract reports specific performance improvements (pair-coding pass rate 28.8%→54.7% on CooperBench, up to 11-point gains with 58% wall-clock reduction, TerminalBench-2 34.2%→39.4%) but supplies no experimental details, baselines, statistical tests, error bars, or methodology. This prevents verification that the gains are attributable to the typed execution trace and forking features rather than unstated factors, which is load-bearing for the central claim that 'these results establish Shepherd as an efficient infrastructure for programming meta-agents.'

Authors: The abstract is intentionally concise to highlight key outcomes, following standard academic practice. The full manuscript provides the requested experimental details, including baselines, statistical tests, error bars, and methodology, in the dedicated evaluation sections for each application (runtime intervention, counterfactual meta-optimization, and Tree-RL training). These sections describe controlled experiments that isolate the contributions of the typed execution traces and forking mechanisms, supporting the attribution of the reported gains. The abstract's central claim is thus grounded in the body of the paper rather than standing alone. revision: no

Circularity Check

0 steps flagged

No circularity: empirical claims rest on reported results without derivations or self-referential reductions

full rationale

The provided abstract contains no equations, derivations, fitted parameters, or self-citations. It introduces Shepherd as a functional model with execution traces and forking, then reports three separate empirical applications (runtime intervention on CooperBench, counterfactual optimization on four benchmarks, Tree-RL on TerminalBench-2) with performance deltas. These results are presented as demonstrations rather than as outputs derived from the system's definition by construction. No load-bearing step reduces a prediction or uniqueness claim to an input fit or prior self-citation; the central infrastructure claim is supported by the listed experimental outcomes, which remain externally verifiable in principle.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to identify specific free parameters, axioms, or invented entities. The Lean mechanization likely relies on standard mathematical axioms for formal verification but none are explicitly listed.

pith-pipeline@v0.9.0 · 5479 in / 1149 out tokens · 80011 ms · 2026-05-12T03:21:05.580474+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, AlexanderDuality.lean, ArithmeticFromLogic.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

core operations mechanized in Lean... small algebraic-effects calculus... proof envelopes... typed event in a Git-like execution trace... fork... replay

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.