MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
Pith reviewed 2026-05-22 04:43 UTC · model grok-4.3
The pith
Source-level rewriting lets autonomous agents fix structural failures and raise performance without human updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MOSS performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.
What carries the argument
Deterministic multi-stage pipeline for anchoring source code modifications to failure evidence, with replay verification and gated deployment.
Load-bearing premise
The external coding-agent CLI produces modifications that pass replay verification on the failure batch without introducing undetected regressions or breaking unrelated functionality.
What would settle it
Applying MOSS to OpenClaw or a similar system and measuring whether the four-task mean grader score rises after one cycle while checking for regressions in other functions.
Figures
read the original abstract
Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files, prompt configurations, memory schemas, workflow graphs -- and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MOSS, a system enabling self-evolution in autonomous agent systems through source-level code rewriting rather than limiting changes to text-mutable artifacts such as prompts or skill files. It describes a deterministic multi-stage pipeline that curates production failure evidence, delegates modifications to a pluggable external coding-agent CLI, verifies candidates by replaying the failure batch in ephemeral workers, and promotes successful edits via user-consent-gated container swaps with health-probe rollback. The central empirical claim is that MOSS raises the mean grader score on four OpenClaw tasks from 0.25 to 0.61 after a single evolution cycle without human intervention.
Significance. If the reported performance improvement is shown to be robust, the work would be significant for autonomous agent research by demonstrating a practical route to structural self-adaptation at the code level. Source rewriting is positioned as a strict superset of text-based methods, offering deterministic effects and Turing completeness that avoid base-model compliance issues. The design choice to retain stage ordering and verdicts while outsourcing edits to an external CLI is a pragmatic strength that supports pluggability.
major comments (1)
- Verification and promotion pipeline: The safety argument for promoting edits rests on replay verification against the automatically curated failure batch alone. Because source-level changes can alter routing, hook ordering, and state invariants that affect tasks outside the batch, a passing replay on the failure set does not logically entail absence of regressions on the broader task distribution. No independent regression suite, differential testing on held-out tasks, or invariant check is described, which directly bears on the claim of reliable human-intervention-free evolution.
minor comments (2)
- The abstract and pipeline description would benefit from explicit enumeration of the health probes used for rollback gating and the precise definition of the grader score metric on OpenClaw.
- Add a brief comparison table or baseline description showing how the 0.25 starting score was obtained and whether any controls for prompt-only evolution were run in the same setting.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment below and describe the revisions we intend to incorporate.
read point-by-point responses
-
Referee: Verification and promotion pipeline: The safety argument for promoting edits rests on replay verification against the automatically curated failure batch alone. Because source-level changes can alter routing, hook ordering, and state invariants that affect tasks outside the batch, a passing replay on the failure set does not logically entail absence of regressions on the broader task distribution. No independent regression suite, differential testing on held-out tasks, or invariant check is described, which directly bears on the claim of reliable human-intervention-free evolution.
Authors: We agree that verification against the failure batch alone does not guarantee the absence of regressions on the broader task distribution, since source-level edits can affect routing, hook ordering, and state invariants. Our design focuses on production-derived failure evidence to target observed issues, with user-consent gating and health-probe rollback providing additional safeguards during promotion. To address this point, we will revise the manuscript to explicitly discuss the scope of the verification and include results from differential testing on held-out tasks. revision: yes
Circularity Check
Empirical pipeline description with no mathematical derivations or self-referential reductions
full rationale
The manuscript presents MOSS as an empirical system for source-level self-rewriting in agentic substrates, anchored to failure batches and verified via replay in ephemeral workers. The reported lift from 0.25 to 0.61 on OpenClaw tasks is framed as the measured outcome of this pipeline rather than a derived prediction or first-principles result. No equations, fitted parameters, uniqueness theorems, or ansatzes appear that could reduce to their own inputs by construction. The central claim therefore remains an externally falsifiable empirical observation against the stated benchmarks and does not exhibit any of the enumerated circularity patterns.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.