hub Canonical reference

Agent Workflow Memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig · 2024 · cs.CL · arXiv 2409.07429

Canonical reference. 86% of citing Pith papers cite this work as background.

65 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 65 citing papers arXiv PDF

abstract

Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 method 2 baseline 1

citation-polarity summary

background 12 baseline 1 use method 1

representative citing papers

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

MemSyco-Bench: Benchmarking Sycophancy in Agent Memory

cs.IR · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

MemSyco-Bench is a benchmark covering five tasks to evaluate memory-induced sycophancy in LLM agents, testing rejection of invalid memory, scope respect, conflict resolution, update tracking, and valid personalization.

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

DMV-Bench introduces the first interactive benchmark for multimodal-agent visual memory via incidental cue injection on product images, and DualMem, a parallel visual-verbal memory architecture, outperforms baselines across chain lengths 5-50 on two VLMs.

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

cs.AI · 2026-06-06 · unverdicted · novelty 7.0

PACE is a training-free anytime-valid commit gate using testing-by-betting e-processes that controls per-candidate false-commit probability for self-evolving agents and reduces spurious edits compared to greedy acceptance.

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

SkillDAG builds a self-evolving typed skill graph that LLM agents query and update at inference time, raising success on ALFWorld and SkillsBench by 12.8 and 8.6 points over graph baselines.

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

cs.CR · 2026-05-25 · unverdicted · novelty 7.0

CyberEvolver introduces a four-layer self-evolving agent architecture with trace-to-diagnosis and population beam search that raises seed agent success rates by 13.6% on CTF, exploitation, and penetration tasks across four LLMs.

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

cs.AI · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

The paper diagnoses library drift in self-evolving LLM skill libraries and demonstrates a governance recipe raising pass@1 from 0.258 to 0.584 on MBPP+ hard-100.

EXG: Self-Evolving Agents with Experience Graphs

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop applications.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis

eess.SY · 2026-03-18 · unverdicted · novelty 7.0

PowerDAG achieves 94-100% success on unseen distribution grid analysis queries by combining adaptive retrieval with similarity-decay cutoff and just-in-time supervision, outperforming ReAct, LangChain, and CrewAI baselines.

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

cs.CR · 2024-10-03 · unverdicted · novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

PEEU enables a 7B MLLM to reach 30.6% accuracy on GUI task planning by autonomous exploration and hindsight experience synthesis, outperforming a 32B model through stronger high-level OOD generalization.

MetaPS: Adaptive Programmatic Strategy Selection for Market Agents

cs.AI · 2026-06-21 · unverdicted · novelty 6.0

MetaPS trains models via simulation rollouts to select from programmatic strategy libraries for market agents, yielding better performance than fixed or direct LLM baselines across model sizes.

Recursive Self-Evolving Agents via Held-Out Selection

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

RSEA adds a strict held-out keep-better gate to recursive self-evolution of agent artifacts, yielding monotone-safe gains or parity with the base ReAct agent on ALFWorld, GAIA, τ-bench, and WebShop.

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

cs.AI · 2026-06-16 · unverdicted · novelty 6.0

SkillMigrator reduces LLM-action counts by 8-10% on WebArena and Mind2Web by transferring web skills via layout-matched transferable interaction patterns.

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

cs.AI · 2026-06-16 · unverdicted · novelty 6.0

SEAGym turns existing benchmarks into multi-view evaluation sources for measuring reusable improvements in LLM agent harnesses, revealing complementary signals missed by single-curve or isolated-task tests.

citing papers explorer

Showing 50 of 65 citing papers.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows cs.CL · 2026-05-13 · unverdicted · none · ref 1 · internal anchor
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 24 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
MemSyco-Bench: Benchmarking Sycophancy in Agent Memory cs.IR · 2026-07-01 · unverdicted · none · ref 31 · 2 links · internal anchor
MemSyco-Bench is a benchmark covering five tasks to evaluate memory-induced sycophancy in LLM agents, testing rejection of invalid memory, scope respect, conflict resolution, update tracking, and valid personalization.
DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection cs.CV · 2026-06-25 · unverdicted · none · ref 16 · internal anchor
DMV-Bench introduces the first interactive benchmark for multimodal-agent visual memory via incidental cue injection on product images, and DualMem, a parallel visual-verbal memory architecture, outperforms baselines across chain lengths 5-50 on two VLMs.
Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory cs.AI · 2026-06-08 · unverdicted · none · ref 70 · internal anchor
SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.
PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents cs.AI · 2026-06-06 · unverdicted · none · ref 11 · internal anchor
PACE is a training-free anytime-valid commit gate using testing-by-betting e-processes that controls per-candidate false-commit probability for self-evolving agents and reduces spurious edits compared to greedy acceptance.
SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale cs.AI · 2026-06-02 · unverdicted · none · ref 14 · internal anchor
SkillDAG builds a self-evolving typed skill graph that LLM agents query and update at inference time, raising success on ALFWorld and SkillsBench by 12.8 and 8.6 points over graph baselines.
CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly cs.CR · 2026-05-25 · unverdicted · none · ref 66 · internal anchor
CyberEvolver introduces a four-layer self-evolving agent architecture with trace-to-diagnosis and population beam search that raises seed agent success rates by 13.6% on CTF, exploitation, and penetration tasks across four LLMs.
Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries cs.AI · 2026-05-19 · unverdicted · none · ref 12 · 2 links · internal anchor
The paper diagnoses library drift in self-evolving LLM skill libraries and demonstrates a governance recipe raising pass@1 from 0.258 to 0.584 on MBPP+ hard-100.
EXG: Self-Evolving Agents with Experience Graphs cs.AI · 2026-05-18 · unverdicted · none · ref 28 · internal anchor
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents cs.CL · 2026-05-16 · unverdicted · none · ref 3 · internal anchor
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents cs.AI · 2026-05-13 · unverdicted · none · ref 44 · 2 links · internal anchor
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
Learning, Fast and Slow: Towards LLMs That Adapt Continually cs.LG · 2026-05-12 · unverdicted · none · ref 60 · 2 links · internal anchor
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents cs.AI · 2026-05-11 · unverdicted · none · ref 28 · internal anchor
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory cs.CL · 2026-04-29 · unverdicted · none · ref 20 · internal anchor
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning cs.AI · 2026-04-10 · unverdicted · none · ref 10 · internal anchor
A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop applications.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 129 · internal anchor
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis eess.SY · 2026-03-18 · unverdicted · none · ref 28 · internal anchor
PowerDAG achieves 94-100% success on unseen distribution grid analysis queries by combining adaptive retrieval with similarity-decay cutoff and just-in-time supervision, outperforming ReAct, LangChain, and CrewAI baselines.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents cs.CR · 2024-10-03 · unverdicted · none · ref 139 · internal anchor
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning cs.CL · 2026-06-25 · unverdicted · none · ref 38 · internal anchor
PEEU enables a 7B MLLM to reach 30.6% accuracy on GUI task planning by autonomous exploration and hindsight experience synthesis, outperforming a 32B model through stronger high-level OOD generalization.
MetaPS: Adaptive Programmatic Strategy Selection for Market Agents cs.AI · 2026-06-21 · unverdicted · none · ref 73 · internal anchor
MetaPS trains models via simulation rollouts to select from programmatic strategy libraries for market agents, yielding better performance than fixed or direct LLM baselines across model sizes.
Recursive Self-Evolving Agents via Held-Out Selection cs.AI · 2026-06-17 · unverdicted · none · ref 18 · internal anchor
RSEA adds a strict held-out keep-better gate to recursive self-evolution of agent artifacts, yielding monotone-safe gains or parity with the base ReAct agent on ALFWorld, GAIA, τ-bench, and WebShop.
Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns cs.AI · 2026-06-16 · unverdicted · none · ref 6 · internal anchor
SkillMigrator reduces LLM-action counts by 8-10% on WebArena and Mind2Web by transferring web skills via layout-matched transferable interaction patterns.
SEAGym: An Evaluation Environment for Self-Evolving LLM Agents cs.AI · 2026-06-16 · unverdicted · none · ref 53 · internal anchor
SEAGym turns existing benchmarks into multi-view evaluation sources for measuring reusable improvements in LLM agent harnesses, revealing complementary signals missed by single-curve or isolated-task tests.
Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition cs.AI · 2026-06-05 · unverdicted · none · ref 2 · internal anchor
W2S framework with RWSA decomposition converts heterogeneous traces into Skills and improves behavioral replay consistency by 10.5% over summarization baselines on 70 Skills.
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories cs.CL · 2026-05-31 · unverdicted · none · ref 61 · internal anchor
SkillAdaptor introduces step-level failure attribution and targeted skill updates for LLM agents, yielding performance gains on WebShop, PinchBench, and Claw-Eval benchmarks.
FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search cs.AI · 2026-05-30 · unverdicted · none · ref 21 · internal anchor
FALAT improves failure attribution in LLM agent trajectories via dependency-guided search, achieving 46.0% step-level accuracy on algorithm-generated and 29.1% on hand-crafted trajectories in the Who&When benchmark.
MemPro: Agentic Memory Systems as Evolvable Programs cs.CL · 2026-05-30 · unverdicted · none · ref 12 · internal anchor
MemPro evolves the entire MCR pipeline as runnable programs via failure-guided refinement on a version tree and outperforms static baselines on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA.
ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents cs.CL · 2026-05-29 · unverdicted · none · ref 48 · internal anchor
ExpGraph builds a graph of summarized agent experiences and uses graph diffusion plus an RL-trained retrieval copilot to improve frozen LLM executors on QA, math, code, and agentic tasks without parameter updates.
Rethinking Memory as Continuously Evolving Connectivity cs.CL · 2026-05-27 · unverdicted · none · ref 52 · internal anchor
FluxMem evolves memory as a heterogeneous graph via three refinement stages and reports consistent state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents cs.AI · 2026-05-21 · conditional · none · ref 25 · internal anchor
Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.
Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents cs.CL · 2026-05-20 · unverdicted · none · ref 32 · internal anchor
Auto-Dreamer trains an offline memory consolidator via GRPO on agent performance to abstract cross-session patterns, outperforming baselines by 7 points on ScienceWorld with 12x smaller memory and generalizing to ALFWorld and WebArena.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents cs.CV · 2026-05-18 · conditional · none · ref 65 · internal anchor
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases cs.LG · 2026-05-10 · unverdicted · none · ref 30 · internal anchor
Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning cs.AI · 2026-04-28 · unverdicted · none · ref 14 · internal anchor
DRIVE disentangles reasoning and interaction skills for web agents via dual-level modeling and scene-aware coordination, reaching 52.8% success on WebArena tasks.
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology cs.AI · 2026-04-19 · unverdicted · none · ref 39 · internal anchor
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
SkillDroid: Compile Once, Reuse Forever cs.HC · 2026-04-16 · conditional · none · ref 23 · internal anchor
SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 rounds while the stateless baseline drops to 44%.
Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation cs.DC · 2026-04-08 · unverdicted · none · ref 7 · internal anchor
A Compile-and-Execute system decouples LLM reasoning from browser execution via a one-shot JSON blueprint, reducing inference from O(M x N) to amortized O(1) for repetitive web workflows.
Procedural Knowledge at Scale Improves Reasoning cs.CL · 2026-04-01 · unverdicted · none · ref 39 · internal anchor
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
Real-Time Procedural Learning From Experience for AI Agents cs.AI · 2025-11-27 · unverdicted · none · ref 16 · internal anchor
PRAXIS enables AI agents to acquire procedural knowledge in real time by indexing and retrieving state-action-result experiences, leading to better accuracy, reliability, and efficiency on web browsing benchmarks.
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents cs.CL · 2025-09-09 · unverdicted · none · ref 51 · internal anchor
VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserving normal performance.
AgRefactor: Self-Evolving Agentic Workflow for HLS Compatibility and Performance cs.AI · 2026-06-29 · unverdicted · none · ref 16 · internal anchor
AgRefactor deploys a self-evolving multi-agent workflow that combines LLM rewrites with automated tools to convert software into HLS code, matching or beating baselines on long benchmarks and delivering 6.51x geometric mean speedup after optimization.
RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources cs.SE · 2026-06-28 · unverdicted · none · ref 11 · internal anchor
RESOURCE2SKILL converts multimodal human resources into a hierarchical Skill Wiki of executable agent skills, reporting +11.9 percentage point average gains over no-skill baselines across seven authoring domains.
SKILL-DISCO: Distilling and Compiling Agent Traces into Reusable Procedural Skills cs.AI · 2026-06-25 · unverdicted · none · ref 60 · internal anchor
SkillDisCo distills reusable PFSM subgraphs from successful agent traces and compiles them into callable procedural skills, improving success rates and reducing turns on ALFWorld and WebArena.
How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs cs.AI · 2026-06-18 · unverdicted · none · ref 7 · internal anchor
Hierarchically grouped demonstrations raise pass rates from 76.7% to 90.7% on 43 vague-description tasks while flat logs show smaller non-significant gains.
Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution cs.LG · 2026-06-18 · unverdicted · none · ref 15 · internal anchor
MAA formalizes alignability and comparability conditions and uses differential signals, EMA accumulation, and semantic identity merging to enable cross-batch operation-level evidence accumulation, outperforming batch-level baselines in 14 of 16 settings while matching online methods.
EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal Navigation cs.AI · 2026-06-16 · unverdicted · none · ref 41 · internal anchor
EvolveNav adds an agentic rule memory with UCB retrieval and a memory-guided preflection module to enable continuous improvement in zero-shot object goal navigation, reporting a 10.1% success rate gain over baselines.
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application cs.CL · 2026-06-10 · unverdicted · none · ref 229 · internal anchor
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
Rethinking Continual Experience Internalization for Self-Evolving LLM Agents cs.CL · 2026-06-03 · unverdicted · none · ref 9 · internal anchor
Existing methods for turning LLM interaction experience into parametric skills collapse over multiple iterations; principle-level experience, step-wise injection, and off-policy teacher distillation yield more stable continual learning.
VikingMem: A Memory Base Management System for Stateful LLM-based Applications cs.AI · 2026-05-28 · unverdicted · none · ref 65 · internal anchor
VikingMem implements the Memory Base paradigm via event-centric extraction and entity updates on VikingDB with temporal compression, claiming up to 30% better retrieval effectiveness on long-term memory benchmarks.

Agent Workflow Memory

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer