pith. machine review for the scientific record. sign in

arxiv: 2602.02474 · v1 · submitted 2026-02-02 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-12 09:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords LLM agentsmemory managementskill learningself-evolving agentslong contextagent memory
0
0 comments X

The pith

MemSkill enables LLM agents to learn and evolve their memory operations as reusable skills rather than using fixed hand-designed rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that memory management in LLM agents can be improved by treating extraction, consolidation, and pruning as learnable skills that a system can select and refine automatically. Traditional approaches use static procedures that embed human assumptions about what to remember, which fail to adapt when interaction patterns vary or histories grow long. By introducing a controller to choose skills, an executor to apply them, and a designer to analyze failures and propose new skills, MemSkill creates a self-improving loop. This matters because better memory handling could allow agents to sustain accurate recall over extended tasks without constant human redesign of the memory system.

Core claim

MemSkill reframes memory operations as learnable and evolvable skills. It uses a controller to select relevant skills from a growing set, an LLM-based executor to generate skill-guided memories from interaction traces, and a designer that periodically examines cases where skills produce incorrect or incomplete memories to propose refinements or entirely new skills, thereby improving both the selection policy and the skill collection itself over time.

What carries the argument

The closed-loop architecture of controller for skill selection, LLM executor for producing guided memories, and designer for evolving the skill set based on hard cases.

If this is right

  • Task performance increases on benchmarks involving long contexts and memory demands such as LoCoMo and HotpotQA.
  • The approach generalizes across different environments including navigation tasks like ALFWorld.
  • Skills become more specialized and effective as the designer adds refinements from observed failures.
  • Overall memory quality improves without relying on manually crafted rules for each new scenario.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such self-evolving memory could allow agents to develop strategies tailored to specific user interaction styles over time.
  • Similar closed-loop evolution might be applied to other agent modules like planning or tool use.
  • Reduced need for initial hand-design could make deploying agents in new domains faster and less labor-intensive.

Load-bearing premise

The designer module can correctly identify hard cases and generate useful skill improvements that enhance memory without adding instability or errors.

What would settle it

Running the system on a long-horizon task and finding that evolved skills lead to no gain or worse performance compared to the initial set would falsify the central claim.

read the original abstract

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MemSkill, a closed-loop framework for LLM agent memory management that reframes static memory operations as learnable, evolvable skills. It consists of a controller that selects relevant skills, an executor that applies them to produce memories from interaction traces, and a designer that periodically reviews failures to propose skill refinements or new skills. The approach is evaluated on LoCoMo, LongMemEval, HotpotQA, and ALFWorld, claiming improved task performance over baselines and good generalization.

Significance. If the self-evolution claims hold with proper validation, this could meaningfully advance adaptive memory systems for LLM agents by reducing reliance on hand-designed priors and enabling ongoing improvement. The work highlights a promising direction for self-improving agents, though its impact depends on demonstrating that the designer module contributes measurably without instability.

major comments (2)
  1. [Experiments] Experiments section (results on LoCoMo, LongMemEval, HotpotQA, ALFWorld): The reported performance gains are not accompanied by ablations that isolate the designer's contribution (e.g., fixed initial skill set vs. evolved skills), making it impossible to attribute improvements specifically to the self-evolution loop rather than the initial controller-executor setup or prompting.
  2. [Method (designer module)] Method section describing the designer: The mechanism for reviewing hard cases, proposing refinements/new skills, and validating that these changes improve memory quality without introducing errors or instability is not detailed with concrete procedures, validation steps, or failure modes; this is load-bearing for the closed-loop claim but lacks the specificity needed to assess reliability.
minor comments (1)
  1. [Abstract] Abstract: The claim of generalization 'across settings' would benefit from a brief clarification of what 'settings' refers to (e.g., task types, history lengths) to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the potential of MemSkill to advance adaptive memory systems for LLM agents. We appreciate the emphasis on rigorous validation of the self-evolution claims. Below we address the major comments point by point, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (results on LoCoMo, LongMemEval, HotpotQA, ALFWorld): The reported performance gains are not accompanied by ablations that isolate the designer's contribution (e.g., fixed initial skill set vs. evolved skills), making it impossible to attribute improvements specifically to the self-evolution loop rather than the initial controller-executor setup or prompting.

    Authors: We agree that isolating the designer's contribution is essential for substantiating the self-evolution claims. The current manuscript includes comparisons against strong static-memory baselines and reports analyses of skill evolution, but it does not contain an explicit ablation holding the initial skill set fixed while disabling the designer. We will add this ablation in the revised Experiments section, reporting performance deltas on all four benchmarks (LoCoMo, LongMemEval, HotpotQA, ALFWorld) to quantify the incremental benefit attributable to the closed-loop designer. This will allow readers to distinguish gains from the initial controller-executor design versus ongoing skill evolution. revision: yes

  2. Referee: [Method (designer module)] Method section describing the designer: The mechanism for reviewing hard cases, proposing refinements/new skills, and validating that these changes improve memory quality without introducing errors or instability is not detailed with concrete procedures, validation steps, or failure modes; this is load-bearing for the closed-loop claim but lacks the specificity needed to assess reliability.

    Authors: We acknowledge that the designer module description in the current manuscript is high-level and requires greater specificity to support the closed-loop claim. In the revised Method section we will expand the description to include: (1) the exact procedure for identifying hard cases (e.g., failure detection via task outcome and memory-consistency checks), (2) the prompting template and decision criteria used by the designer to propose refinements versus new skills, (3) the validation protocol (testing proposed skills on a held-out set of interaction traces before acceptance), and (4) discussion of failure modes such as skill redundancy, error propagation, or instability. We will also add pseudocode and a concrete example of one skill evolution cycle to make the reliability of the loop assessable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework validated on external benchmarks

full rationale

The paper describes MemSkill as a procedural system with controller, executor, and designer components that evolve memory skills through review of hard cases. Claims rest on experimental results across LoCoMo, LongMemEval, HotpotQA, and ALFWorld showing gains over baselines, with no equations, fitted parameters, or derivations presented. No self-citations appear as load-bearing justifications for uniqueness or ansatzes, and the closed-loop improvement is not shown to reduce to its own inputs by construction. The approach is self-contained against external task benchmarks rather than internally tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities beyond high-level components can be extracted or verified.

invented entities (1)
  • memory skills no independent evidence
    purpose: learnable and evolvable routines for extracting, consolidating, and pruning information
    Positioned as the core innovation replacing hand-designed operations.

pith-pipeline@v0.9.0 · 5558 in / 1185 out tokens · 54638 ms · 2026-05-12T09:05:35.313883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    MemSkill introduces a designer that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself.

  • IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Context to Skills: Can Language Models Learn from Context Skillfully?

    cs.AI 2026-04 unverdicted novelty 8.0

    Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

  2. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 conditional novelty 7.0

    ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

  3. EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-ben...

  4. LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

    cs.CL 2026-05 unverdicted novelty 7.0

    LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

  5. OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on f...

  6. Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

  7. TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.

  8. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 7.0

    SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.

  9. Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

    cs.LG 2026-05 unverdicted novelty 7.0

    CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.

  10. Belief Memory: Agent Memory Under Partial Observability

    cs.AI 2026-05 unverdicted novelty 7.0

    BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...

  11. Belief Memory: Agent Memory Under Partial Observability

    cs.AI 2026-05 unverdicted novelty 7.0

    BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.

  12. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  13. SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.

  14. M$^\star$: Every Task Deserves Its Own Memory Harness

    cs.PL 2026-04 unverdicted novelty 7.0

    M* evolves distinct Python memory programs per task via population-based reflective search, outperforming fixed-memory baselines on conversation, planning, and reasoning benchmarks.

  15. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  16. HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

    cs.AI 2026-05 unverdicted novelty 6.0

    HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.

  17. Skill-R1: Agent Skill Evolution via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Skill-R1 applies bi-level group-relative policy optimization to evolve skills recurrently from verified outcomes, yielding gains over baselines on multi-step tasks.

  18. Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation

    cs.AI 2026-05 unverdicted novelty 6.0

    Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution und...

  19. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

    cs.AI 2026-05 unverdicted novelty 6.0

    SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.

  20. Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries

    cs.CL 2026-05 unverdicted novelty 6.0

    GoSkills converts flat skill lists into role-labeled execution contexts via anchor-centered groups and graph expansion, preserving coverage and improving rewards on SkillsBench and ALFWorld under small skill budgets.

  21. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...

  22. When Continual Learning Moves to Memory: A Study of Experience Reuse in LLM Agents

    cs.LG 2026-04 unverdicted novelty 6.0

    External memory does not eliminate continual learning challenges in LLM agents but reshapes them into issues of memory representation and retrieval design, with abstract memories aiding transfer while organization cho...

  23. AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

    cs.CL 2026-04 unverdicted novelty 6.0

    AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.

  24. Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.

  25. SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

    cs.AI 2026-04 unverdicted novelty 6.0

    SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.

  26. Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

    cs.AI 2026-04 conditional novelty 6.0

    The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.

  27. Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution

    cs.AI 2026-05 unverdicted novelty 5.0

    Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to ...

  28. PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents

    cs.CL 2026-05 unverdicted novelty 5.0

    An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.

  29. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...

  30. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...

  31. Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

    cs.AI 2026-04 unverdicted novelty 5.0

    Bian Que is an agentic framework using a unified operational paradigm, flexible Skill Arrangement, and self-evolving mechanism to automate O&M tasks, achieving 75% alert reduction and over 50% MTTR cut in production d...

  32. Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

    cs.AI 2026-04 unverdicted novelty 5.0

    Bian Que deploys an agentic system with flexible skills and self-evolution on a major e-commerce search engine, cutting alerts by 75%, reaching 80% root-cause accuracy, and halving resolution time.

  33. From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

    cs.SE 2026-04 unverdicted novelty 5.0

    Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17...

  34. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    cs.SE 2026-04 accept novelty 5.0

    LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

  35. MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought

    cs.MA 2026-04 unverdicted novelty 5.0

    MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.

  36. Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems

    cs.MA 2026-03 unverdicted novelty 5.0

    LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.

  37. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  38. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 4.0

    Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.

  39. Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence

    cs.AI 2026-05 unverdicted novelty 3.0

    Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 33 Pith papers

  1. [1]

    **Memory Storage**: The system applies memory management skills to decide what information to store from the text chunk

  2. [2]

    **Memory Retrieval**: At question time, it retrieves the most relevant memories by semantic similarity

  3. [3]

    **Answer Generation**: An LLM answers using the retrieved memories. Failures can occur at any stage: - **Storage failure**: Important information was never stored (skill missing or misapplied) - **Retrieval failure**: Relevant memory exists but was not retrieved (embedding mismatch) - **Memory quality failure**: Memory exists but is too vague or incomplet...

  4. [4]

    For each case, check whether the retrieved memories contain the answer or the needed evidence

  5. [5]

    If missing, decide whether it was never stored (storage failure) or stored but too weak (memory quality failure)

  6. [6]

    If the answer is present but not retrieved, label it retrieval failure and avoid changing skills unless the pattern repeats

  7. [7]

    Group cases into patterns tied to information types, entities, temporal details, or constraints

  8. [8]

    For each pattern, propose a concrete skill change: add a new skill or refine an existing one to capture missing details

  9. [9]

    failure patterns

    Provide up to{max changes}recommendations total (use fewer if only one change is justified). ## Output Format Provide your analysis as JSON: { “failure patterns”: [ { “pattern name”: “[descriptive name for this failure pattern]”, “affected cases”: [list of case numbers, e.g., 1, 3, 5], “root cause”: “[storage failure—retrieval failure—memory quality failu...

  10. [10]

    instruction template MUST be a skill-style guide and MUST NOT include context placeholders (the executor injects the text chunk and retrieved memories)

  11. [11]

    instruction template MUST clearly state purpose, when to use, and constraints

  12. [12]

    instruction template MUST specify the allowed action type (INSERT or UPDATE only)

  13. [13]

    insert” or “update

    For new operations, ‘update type‘ must be either “insert” or “update” (delete and noop operations are not evolved at this time)

  14. [14]

    insert” or “update

    Only propose operations with update type “insert” or “update”

  15. [15]

    ENHANCED

    Avoid labels like “ENHANCED”, “ADV ANCED”, or other marketing adjectives in descriptions or templates; keep phrasing neutral and task-specific

  16. [16]

    Do NOT embed output blocks; the executor handles output formatting and can apply the skill multiple times

  17. [17]

    The number of changes in the list MUST be less than max changes

  18. [18]

    name”: “extract personal preferences

    Do NOT modify the same operation more than once in a single response, and do NOT refine an operation you add in the same response ## Example of a Well-Designed Insert Operation { “name”: “extract personal preferences”, “description”: “Memory management skill for capturing personal preferences and habits mentioned in the text chunk.”, “update type”: “inser...

  19. [19]

    Correctness: - Is the model answer factually consistent with ANY of the correct answers? - Does it avoid contradictions or introducing false information?

  20. [20]

    Relevance: - Does the answer address the question directly without unnecessary content?

  21. [21]

    explanation

    Completeness: - Does the answer include all essential information needed to fully answer the question? - Partial answers are allowed but should receive lower scores. Scoring Rules: - Score = 1.0 if the answer is fully correct. - Score = 0.5 if the answer is partially correct but incomplete or slightly inaccurate. - Score = 0.0 if the answer is incorrect, ...