FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
hub Canonical reference
Agent Workflow Memory
Canonical reference. 85% of citing Pith papers cite this work as background.
abstract
Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
DMV-Bench introduces the first interactive benchmark for multimodal-agent visual memory via incidental cue injection on product images, and DualMem, a parallel visual-verbal memory architecture, outperforms baselines across chain lengths 5-50 on two VLMs.
SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.
PACE is a training-free anytime-valid commit gate using testing-by-betting e-processes that controls per-candidate false-commit probability for self-evolving agents and reduces spurious edits compared to greedy acceptance.
SkillDAG builds a self-evolving typed skill graph that LLM agents query and update at inference time, raising success on ALFWorld and SkillsBench by 12.8 and 8.6 points over graph baselines.
CyberEvolver introduces a four-layer self-evolving agent architecture with trace-to-diagnosis and population beam search that raises seed agent success rates by 13.6% on CTF, exploitation, and penetration tasks across four LLMs.
The paper diagnoses library drift in self-evolving LLM skill libraries and demonstrates a governance recipe raising pass@1 from 0.258 to 0.584 on MBPP+ hard-100.
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop applications.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
PowerDAG achieves 94-100% success on unseen distribution grid analysis queries by combining adaptive retrieval with similarity-decay cutoff and just-in-time supervision, outperforming ReAct, LangChain, and CrewAI baselines.
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
RSEA adds a strict held-out keep-better gate to recursive self-evolution of agent artifacts, yielding monotone-safe gains or parity with the base ReAct agent on ALFWorld, GAIA, τ-bench, and WebShop.
W2S framework with RWSA decomposition converts heterogeneous traces into Skills and improves behavioral replay consistency by 10.5% over summarization baselines on 70 Skills.
SkillAdaptor introduces step-level failure attribution and targeted skill updates for LLM agents, yielding performance gains on WebShop, PinchBench, and Claw-Eval benchmarks.
FALAT improves failure attribution in LLM agent trajectories via dependency-guided search, achieving 46.0% step-level accuracy on algorithm-generated and 29.1% on hand-crafted trajectories in the Who&When benchmark.
MemPro evolves the entire MCR pipeline as runnable programs via failure-guided refinement on a version tree and outperforms static baselines on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA.
ExpGraph builds a graph of summarized agent experiences and uses graph diffusion plus an RL-trained retrieval copilot to improve frozen LLM executors on QA, math, code, and agentic tasks without parameter updates.
citing papers explorer
-
What makes a harness a harness: necessary and sufficient conditions for an agent harness
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
-
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
Proposes Knowledge Objects to externalize implicit AI knowledge for human verification, addressing a gap in current reliability methods.
- MemSyco-Bench: Benchmarking Sycophancy in Agent Memory