pith. machine review for the scientific record. sign in

arxiv: 2603.29093 · v2 · submitted 2026-03-31 · 💻 cs.CL · cs.AI· cs.IR

Recognition: 1 theorem link

· Lean Theorem

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Ankit Chadha, Masud Moshtaghi, Pratyay Banerjee

Pith reviewed 2026-05-14 00:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords LLM agentsexperience replayprocedural memorynon-parametric learningautonomous agentsstructured retrievalplan reuseonline learning
0
0 comments X

The pith

Structured procedural memory lets LLM agents reuse past solutions on new tasks without changing weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based autonomous agents typically lack persistent procedural memory and must re-derive solutions even for structurally similar tasks. APEX-EM addresses this by maintaining a non-parametric memory of full execution traces that include planning steps, artifacts, error analyses, and quality scores. It uses a Plan-Retrieve-Generate-Iterate-Ingest workflow with task verifiers to decide what to store as positive or negative examples. A hybrid retrieval system combining semantic search, structural signatures, and plan DAG traversal enables transfer to tasks with no lexical overlap. Experiments on multiple benchmarks show large accuracy improvements, such as 48 percentage points on KGQAGen-10k, demonstrating the value of external structured memory for online learning.

Core claim

The paper establishes that by encoding complete procedural-episodic traces in a dual-outcome experience memory and retrieving them through a combination of semantic, structural, and graph-based methods, agents can accumulate and reuse procedural knowledge across tasks, achieving substantial gains in accuracy and success rate on reasoning and code generation benchmarks while keeping the underlying model weights frozen.

What carries the argument

The dual-outcome Experience Memory that stores successful experiences as positive in-context examples and failures as negative examples with error annotations, retrieved via hybrid methods including plan DAG traversal.

If this is right

  • Agents can improve performance on new tasks by directly reusing verified procedural plans from memory.
  • Cross-task transfer occurs even when tasks share no words in common but have analogous structures.
  • Online learning proceeds continuously without any modification to the model's parameters.
  • Rich multi-dimensional reward signals from task verifiers enhance the quality of stored experiences.
  • Component contributions vary by task type, with iteration helping compensate for weaker feedback in some domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Long-running agents could build libraries of reusable procedures that grow more valuable over extended sessions.
  • Similar memory structures might help in domains beyond LLMs, such as robotic planning systems.
  • Explicit separation of procedural structure from surface language could reduce reliance on ever-larger context windows.

Load-bearing premise

The hybrid retrieval can reliably detect and transfer useful procedural knowledge between tasks that have matching operational structures but share no words or surface features.

What would settle it

A controlled test set of tasks with deliberately matched procedural structures but completely different wording and entities, where the system shows no accuracy improvement over the no-memory baseline.

read the original abstract

LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present APEX-EM, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a structured experience representation encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a Plan-Retrieve-Generate-Iterate-Ingest (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a dual-outcome Experience Memory with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench, KGQAGen-10k, and Humanity's Last Exam using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6% accuracy versus 41.3% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9%). On BigCodeBench, it reaches 83.3% SR from a 53.9% baseline (+29.4pp), exceeding MemRL's +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0% from 25.2% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces APEX-EM, a non-parametric online learning framework for LLM-based autonomous agents that accumulates, retrieves, and reuses structured procedural-episodic experience traces without modifying model weights. It defines a structured experience representation (planning steps, artifacts, iteration history with error analysis, and quality scores), a Plan-Retrieve-Generate-Iterate-Ingest (PRGII) workflow with Task Verifiers providing multi-dimensional rewards, and a dual-outcome Experience Memory using hybrid retrieval (semantic search, structural signature matching, and plan DAG traversal). Evaluations on BigCodeBench, KGQAGen-10k, and Humanity's Last Exam report large gains, including 89.6% accuracy on KGQAGen-10k versus 41.3% without memory and surpassing an oracle-retrieval bound of 84.9%.

Significance. If the results hold after clarification of baselines, the work demonstrates a practical route to persistent procedural memory in frozen-backbone agents, with potential for cross-domain transfer based on operational structure rather than lexical overlap. The emphasis on full traces (including negative examples with error annotations) and task-dependent ablation insights are strengths that could inform future non-parametric agent designs.

major comments (1)
  1. [Abstract] Abstract: the reported 89.6% accuracy on KGQAGen-10k exceeds the stated oracle-retrieval upper bound of 84.9%. An oracle should encode perfect retrieval of the best available experience; exceeding it implies either an inconsistency in oracle construction or that APEX-EM components (PRGII iterations, Task Verifiers, dual-outcome memory, or structured error annotations) are unavailable to the oracle. The manuscript must explicitly define the oracle's scope, retrieval limits, and access to the same traces used by the full system.
minor comments (1)
  1. The abstract notes that backbone differences with MemRL were 'controlled for in our analysis' on BigCodeBench; the full text should provide the specific controls (e.g., model versions, prompt templates, or evaluation protocol) to allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying the need to clarify the oracle-retrieval bound relative to APEX-EM performance. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 89.6% accuracy on KGQAGen-10k exceeds the stated oracle-retrieval upper bound of 84.9%. An oracle should encode perfect retrieval of the best available experience; exceeding it implies either an inconsistency in oracle construction or that APEX-EM components (PRGII iterations, Task Verifiers, dual-outcome memory, or structured error annotations) are unavailable to the oracle. The manuscript must explicitly define the oracle's scope, retrieval limits, and access to the same traces used by the full system.

    Authors: We agree that explicit definition is required. The oracle-retrieval bound (84.9%) is computed by supplying the model with a single perfectly retrieved best-matching experience trace under a non-iterative, single-shot generation protocol that excludes the PRGII workflow, Task Verifier multi-dimensional rewards, and iterative error-analysis loop. APEX-EM, by contrast, inserts the same retrieved traces into the full Plan-Retrieve-Generate-Iterate-Ingest cycle, enabling additional gains from structured error annotations and re-generation. This architectural difference explains the observed exceedance. We will revise the abstract, Section 3 (PRGII workflow), and Section 4 (evaluation) to state the oracle's exact scope, retrieval limits, and lack of access to iterative components. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivations or self-referential reductions

full rationale

The paper describes an empirical non-parametric framework (APEX-EM) with structured experience replay, PRGII workflow, and hybrid retrieval, evaluated via direct accuracy measurements on KGQAGen-10k (89.6% vs baselines), BigCodeBench, and HLE. No equations, parameter fittings, uniqueness theorems, or ansatzes appear in the provided text. All claims reduce to observable benchmark outcomes rather than predictions that collapse to inputs by construction. Self-contained against external benchmarks with no load-bearing self-citations or definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that Task Verifiers supply reliable multi-dimensional signals and that hybrid retrieval can surface structurally analogous experiences. The main invented element is the specific structured trace format and dual-outcome memory; no explicit free parameters are described in the abstract.

axioms (1)
  • domain assumption Task Verifiers provide reliable multi-dimensional reward signals that support effective iteration.
    Stated as part of the PRGII workflow enabling refinement without weight updates.
invented entities (1)
  • Structured procedural-episodic experience representation no independent evidence
    purpose: To encode planning steps, artifacts, iteration history with error analysis, and quality scores for retrieval and reuse.
    New representation introduced to enable the memory system; no independent external validation provided.

pith-pipeline@v0.9.0 · 5670 in / 1473 out tokens · 95523 ms · 2026-05-14T00:43:56.579981+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.