FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
hub
Agent workflow memory
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop applications.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while showing some skills are internalized and others remain external.
Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 rounds while the stateless baseline drops to 44%.
A Compile-and-Execute system decouples LLM reasoning from browser execution via a one-shot JSON blueprint, reducing inference from O(M x N) to amortized O(1) for repetitive web workflows.
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Reliable AI needs structured Knowledge Objects to externalize and enable human validation of implicit knowledge that current methods cannot verify.
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
citing papers explorer
-
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop applications.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while showing some skills are internalized and others remain external.
-
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases
Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
-
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
-
SkillDroid: Compile Once, Reuse Forever
SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 rounds while the stateless baseline drops to 44%.
-
Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation
A Compile-and-Execute system decouples LLM reasoning from browser execution via a one-shot JSON blueprint, reducing inference from O(M x N) to amortized O(1) for repetitive web workflows.
-
Procedural Knowledge at Scale Improves Reasoning
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
-
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems
A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems
LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
Reliable AI needs structured Knowledge Objects to externalize and enable human validation of implicit knowledge that current methods cannot verify.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.