arxiv: 2510.04618 · v3 · submitted 2025-10-06 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang , Changran Hu , Shubhangi Upasani , Boyuan Ma , Fenglu Hong , Vamsidhar Kamanuru , Jay Rainton , Chen Wu

show 5 more authors

Mengmeng Ji Hanchen Li Urmish Thakker James Zou Kunle Olukotun

Authors on Pith no claims yet

Pith reviewed 2026-05-12 16:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords context adaptationLLM agentscontext engineeringagent memoryexecution feedbackplaybook evolutionself-improving systemsdomain-specific reasoning

0 comments

The pith

Treating contexts as evolving playbooks through generation, reflection, and curation lets LLMs improve their own performance on agent and reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ACE as a way to adapt inputs to language models without changing their weights. Instead of rewriting contexts from scratch, which often drops details or erodes knowledge over time, the method builds playbooks that accumulate strategies step by step. The process uses three linked steps to generate new material, reflect on what worked, and curate what to keep. It applies both before deployment and during live use, relying on the model's own task outcomes rather than external labels. This matters because many current LLM applications depend on careful input design, and a reliable way to grow that input over time could make systems more capable at lower cost.

Core claim

ACE treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. This prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches thetop

What carries the argument

The modular generation-reflection-curation process that turns contexts into accumulating playbooks.

If this is right

Contexts can be refined both as one-time system prompts and as ongoing agent memory stores.
Adaptation works from execution outcomes alone, removing the need for curated training examples.
Lower latency and rollout cost accompany the accuracy improvements on agent and finance tasks.
Smaller open-source models reach parity with larger production agents on hard splits of agent benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The playbook structure may support multi-session tasks where strategies must carry over days or weeks of interaction.
Similar incremental curation could reduce the frequency of full model retraining in deployed applications.
If the reflection step generalizes, the method might extend to domains where feedback is noisier than in current benchmarks.

Load-bearing premise

The generation, reflection, and curation steps can be executed without introducing biases or overhead that cancel out the gains in accuracy and speed.

What would settle it

A controlled run on a long-horizon task where repeated playbook updates cause loss of specific facts or where final performance falls below the no-update baseline.

read the original abstract

Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation: modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. We introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ACE gives a practical way to evolve LLM contexts for better agent performance, but the efficiency claims need checking against total inference costs from all process steps.

read the letter

The punchline is that this paper gives a practical playbook for making LLM contexts self-improving without retraining, and the reported results on agents and finance tasks are worth a look if the methods hold up. What is new is the ACE framework's three-step process of generation, reflection, and curation to accumulate strategies incrementally. This directly targets brevity bias and context collapse by keeping detailed knowledge instead of summarizing it away. It applies both to static prompts and dynamic agent memory, using natural execution feedback for adaptation without labels. The paper does well in showing concrete gains: 10.6% on agent benchmarks, 8.6% on finance, and matching or beating top agents on AppWorld with a smaller open-source model. The idea of treating contexts as evolving playbooks that scale with long-context models is a solid extension of prior adaptation work. The soft spots are in the evaluation details and the efficiency accounting. The abstract mentions outperforming baselines and reducing latency and cost, but without specifics on what the baselines are, how many runs, or statistical significance, it's difficult to gauge how robust the improvements are. More importantly, the stress-test concern holds: the modular process involves multiple LLM invocations per update, yet there's no reported total inference cost that includes generation, reflection, and curation. If those extra calls add significant overhead, the claimed reductions in adaptation latency and rollout cost could be overstated. The full paper might address this, but based on the summary, it remains an assumption. This work is for people focused on agentic systems and efficient LLM deployment in specific domains. A practitioner or researcher dealing with context management in production agents would get value from the structured approach and the unsupervised adaptation angle. It deserves a serious referee because the claims are specific and benchmark-based, making them checkable, even if revisions for more rigorous cost analysis and experimental transparency are needed. Recommendation: Yes, send it to peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the ACE (Agentic Context Engineering) framework, which treats contexts as evolving playbooks updated through a modular generation-reflection-curation process to mitigate brevity bias and context collapse in LLM applications. It reports empirical gains of +10.6% on agent benchmarks and +8.6% on finance tasks, reduced adaptation latency and rollout costs, effective unsupervised adaptation via natural execution feedback, and competitive AppWorld leaderboard performance matching or exceeding a top production agent on key splits despite using a smaller open-source model.

Significance. If the reported gains hold under rigorous scrutiny, the work would be significant for advancing context-based self-improvement in LLMs, offering a scalable alternative to fine-tuning that preserves detailed knowledge and leverages long-context capabilities. The unsupervised adaptation aspect and efficiency claims could influence agent design and domain-specific reasoning systems.

major comments (2)

[Abstract] Abstract: The efficiency claims of 'significantly reducing adaptation latency and rollout cost' rest on an unverified assumption that the three-stage modular process adds negligible net overhead. No breakdown of per-stage token counts, wall-clock time, or total inference cost (including all generation, reflection, and curation LLM calls) is provided, so the net savings relative to baselines cannot be evaluated.
[Experiments] Experiments (implied by quantitative claims): The reported performance improvements (+10.6% on agents, +8.6% on finance, AppWorld results) lack any description of baselines, experimental setup, statistical tests, number of runs, variance, or implementation specifics. This absence makes the data-to-claim connection for the central empirical assertions impossible to assess from the manuscript.

minor comments (1)

[Abstract] Abstract: The phrase 'significantly reducing' is used without any quantitative measure of the latency or cost reductions, which reduces clarity on the magnitude of the efficiency benefit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas for improving clarity and rigor in presenting the ACE framework's efficiency and empirical results. We address each major comment below and will revise the manuscript to incorporate additional details and analyses as outlined.

read point-by-point responses

Referee: [Abstract] Abstract: The efficiency claims of 'significantly reducing adaptation latency and rollout cost' rest on an unverified assumption that the three-stage modular process adds negligible net overhead. No breakdown of per-stage token counts, wall-clock time, or total inference cost (including all generation, reflection, and curation LLM calls) is provided, so the net savings relative to baselines cannot be evaluated.

Authors: We agree that the abstract's efficiency claims would be stronger with explicit supporting data. The current manuscript does not include a per-stage breakdown of token counts, wall-clock times, or aggregate inference costs across the generation-reflection-curation pipeline. In the revised version, we will add a dedicated analysis (likely in Section 4 or an appendix) reporting these metrics for ACE versus baselines, including all LLM calls, to demonstrate net savings. This data was collected during our experiments and can be presented without changing the core findings. revision: yes
Referee: [Experiments] Experiments (implied by quantitative claims): The reported performance improvements (+10.6% on agents, +8.6% on finance, AppWorld results) lack any description of baselines, experimental setup, statistical tests, number of runs, variance, or implementation specifics. This absence makes the data-to-claim connection for the central empirical assertions impossible to assess from the manuscript.

Authors: We acknowledge that the manuscript would benefit from expanded methodological transparency to allow full assessment of the reported gains. While the full text describes the benchmarks, key baselines (e.g., standard prompting, iterative rewriting methods, and production agents), and evaluation protocols, we will revise the Experiments section to explicitly detail: the complete list of baselines with implementation references, number of runs (with seeds), statistical tests (e.g., significance levels and variance), standard deviations, and implementation specifics such as model versions, hyperparameters, and prompt structures. This will strengthen reproducibility and the link between data and claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper presents ACE as a modular generation-reflection-curation process for evolving contexts and reports performance gains (+10.6% on agents, +8.6% on finance) plus efficiency improvements solely through comparisons to external baselines and leaderboards such as AppWorld. No derivation chain, equations, fitted parameters, or first-principles results are claimed; the central claims rest on standard empirical evaluation rather than any self-definition, renamed known result, or self-citation that reduces the outcome to the framework's own inputs. The absence of mathematical modeling or predictive steps derived from the method itself makes the reported results independent of internal circular construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities beyond the framework name itself are stated. The central claim rests on the unverified assumption that the described modular process produces the reported gains.

invented entities (1)

ACE framework no independent evidence
purpose: Evolving contexts as playbooks via generation, reflection, and curation
Newly introduced method whose effectiveness is asserted via benchmark results in the abstract.

pith-pipeline@v0.9.0 · 5592 in / 1278 out tokens · 57678 ms · 2026-05-12T16:40:02.866680+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models.
IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
cs.AI 2026-04 unverdicted novelty 7.0

Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
Meta-Harness: End-to-End Optimization of Model Harnesses
cs.AI 2026-03 unverdicted novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
SkillEvolver: Skill Learning as a Meta-Skill
cs.AI 2026-05 unverdicted novelty 6.0

A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
cs.CL 2026-04 unverdicted novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
cs.AI 2026-04 unverdicted novelty 6.0

ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
cs.AI 2026-04 unverdicted novelty 6.0

SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weigh...
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
cs.CL 2026-04 unverdicted novelty 6.0

AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
cs.DB 2026-04 unverdicted novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph
cs.CV 2026-05 unverdicted novelty 5.0

MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
Joint Optimization of Trajectory Control, Resource Allocation, and Task Offloading for Multi-UAV-Assisted IoV
cs.NI 2026-05 unverdicted novelty 5.0

A joint optimization approach using SOCP for UAV trajectories, DRL-LLM for resource scheduling, and LP for offloading achieves higher task success rates and system efficiency than multi-agent RL baselines in simulated...
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
cs.IR 2026-04 unverdicted novelty 5.0

AgenticRecTune deploys five LLM agents (Actor, Critic, Insight, Skill, Online) and a self-evolving Skillhub to handle end-to-end configuration optimization for multi-stage recommendation systems.
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
cs.SE 2026-04 unverdicted novelty 5.0

Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...
Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
cs.AI 2026-04 unverdicted novelty 5.0

Orchestrating one 8B model in three roles at inference time doubles task completion on AppWorld from 5.4% to 8.9%, surpassing a 33B baseline.
How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM's Residual Role in a Planning Agent
cs.AI 2026-04 unverdicted novelty 5.0

Declarative planning in the harness accounts for the bulk of performance (+24.1pp win rate) while the LLM activates on only 4.3% of turns with bounded effect.
A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems
cs.CY 2026-04 unverdicted novelty 5.0

A multi-agent generate-validate-revise framework reduces failures in realism and authenticity for LLM-personalized math problems, with one iteration helping and different strategies varying by criterion.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
AgenticRecTune: Multi-Agent with Self-Evolving Skillhub for Recommendation System Optimization
cs.IR 2026-04 unverdicted novelty 4.0

AgenticRecTune deploys Actor, Critic, Insight, Skill, and Online agents plus a self-evolving Skillhub to propose, filter, test, and learn from recommendation system configurations using Gemini LLMs.
Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants
cs.SE 2026-04 unverdicted novelty 4.0

Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.