hub Canonical reference

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao · 2023

Canonical reference. 75% of citing Pith papers cite this work as background.

44 Pith papers citing it

Background 75% of classified citations

browse 44 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 15 baseline 5

citation-polarity summary

background 15 baseline 5

representative citing papers

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heterogeneous sources.

Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.

Query-Conditioned Test-Time Self-Training for Large Language Models

cs.CL · 2026-05-13 · conditional · novelty 7.0 · 2 refs

QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.

State-Centric Decision Process

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.

SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.

MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.

EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

cs.SE · 2026-05-09 · conditional · novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

RewardHarness: Self-Evolving Agentic Post-Training

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

cs.RO · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

Inference-Time Budget Control for LLM Search Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.

Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Auto-Dreamer trains an offline memory consolidator via GRPO on agent performance to abstract cross-session patterns, outperforming baselines by 7 points on ScienceWorld with 12x smaller memory and generalizing to ALFWorld and WebArena.

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

cs.CV · 2026-05-18 · conditional · novelty 6.0

MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.

A Red Teaming Framework for Evaluating Robustness of AI-enabled Security Orchestration, Automation, and Response Systems

cs.CR · 2026-05-16 · unverdicted · novelty 6.0

A hybrid LLM-RL red teaming framework generates adaptive attack campaigns in simulated enterprise networks to evaluate the robustness of AI-enabled SOAR systems.

DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.

Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.

citing papers explorer

Showing 44 of 44 citing papers.

MemGym: a Long-Horizon Memory Environment for LLM Agents cs.CL · 2026-05-20 · unverdicted · none · ref 41
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory cs.CL · 2026-05-15 · unverdicted · none · ref 22
SMMBench is a benchmark evaluating multimodal agents on cross-source reasoning, conflict resolution, preference reasoning, and action prediction, showing current systems struggle with evidence distributed across heterogeneous sources.
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation cs.SE · 2026-05-14 · unverdicted · none · ref 28
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
Query-Conditioned Test-Time Self-Training for Large Language Models cs.CL · 2026-05-13 · conditional · none · ref 25 · 2 links
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
State-Centric Decision Process cs.AI · 2026-05-12 · unverdicted · none · ref 35
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces cs.AI · 2026-05-12 · unverdicted · none · ref 22
SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.
ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning cs.LG · 2026-05-11 · unverdicted · none · ref 32
ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs cs.AI · 2026-05-11 · unverdicted · none · ref 15
MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium cs.AI · 2026-05-10 · unverdicted · none · ref 56
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries cs.SE · 2026-05-09 · conditional · none · ref 25
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 50 · 2 links
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
RewardHarness: Self-Evolving Agentic Post-Training cs.AI · 2026-05-09 · unverdicted · none · ref 23
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents cs.RO · 2026-05-08 · unverdicted · none · ref 27 · 2 links
MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance cs.CL · 2026-05-08 · unverdicted · none · ref 13
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
Inference-Time Budget Control for LLM Search Agents cs.AI · 2026-05-07 · unverdicted · none · ref 45
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 26
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles cs.LG · 2026-05-21 · unverdicted · none · ref 47
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents cs.CL · 2026-05-20 · unverdicted · none · ref 24
Auto-Dreamer trains an offline memory consolidator via GRPO on agent performance to abstract cross-session patterns, outperforming baselines by 7 points on ScienceWorld with 12x smaller memory and generalizing to ALFWorld and WebArena.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents cs.CV · 2026-05-18 · conditional · none · ref 47
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
A Red Teaming Framework for Evaluating Robustness of AI-enabled Security Orchestration, Automation, and Response Systems cs.CR · 2026-05-16 · unverdicted · none · ref 10
A hybrid LLM-RL red teaming framework generates adaptive attack campaigns in simulated enterprise networks to evaluate the robustness of AI-enabled SOAR systems.
DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery cs.LG · 2026-05-14 · unverdicted · none · ref 18
DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 26
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 19
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment cs.AI · 2026-05-12 · unverdicted · none · ref 38
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 23 · 3 links
STAR presents a failure-aware routing framework using a state-conditioned transition policy and an agent routing matrix combining expert routes with learned recoveries from execution traces to improve multi-agent spatiotemporal reasoning.
The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory cs.LG · 2026-05-10 · unverdicted · none · ref 37
Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.
From History to State: Constant-Context Skill Learning for LLM Agents cs.AI · 2026-05-06 · unverdicted · none · ref 28
Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 86
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis cs.SE · 2026-04-12 · unverdicted · none · ref 49
A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vulnerabilities at 12% failure rate.
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems cs.MA · 2026-04-03 · unverdicted · none · ref 52
LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
What Do Agents Communicate? Characterizing Information Exchange in Multi-Agent Systems cs.MA · 2026-05-19 · unverdicted · none · ref 58
Systematic study of inter-agent communication in LLM multi-agent systems shows reasoning and verification are critical for performance, with a new augmentation technique recovering 86.2% of failures.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 247
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction cs.RO · 2026-05-18 · unverdicted · none · ref 28
Robo-Cortex proposes a self-evolving embodied navigation agent using dual-grain cognitive memory and autonomous knowledge induction from trajectories, reporting SPL gains on IGNav, AR, AEQA and preliminary real-robot tests.
Beyond Scaling: Agents Are Heading to the Edge cs.LG · 2026-05-18 · unverdicted · none · ref 51
Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution cs.CL · 2026-05-18 · unverdicted · none · ref 51
SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.
EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents cs.AI · 2026-05-11 · unverdicted · none · ref 16
EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models cs.AI · 2026-05-11 · unverdicted · none · ref 30
M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations cs.AI · 2026-04-29 · unverdicted · none · ref 41 · 2 links
Bian Que is an agentic framework using a unified operational paradigm, flexible Skill Arrangement, and self-evolving mechanism to automate O&M tasks, achieving 75% alert reduction and over 50% MTTR cut in production deployment.
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve cs.AI · 2026-04-15 · unverdicted · none · ref 21
Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
SpaceMind: A Modular and Self-Evolving Embodied Vision-Language Agent Framework for Autonomous On-orbit Servicing cs.RO · 2026-04-15 · unverdicted · none · ref 29
SpaceMind is a self-evolving modular VLM agent framework that achieves 90-100% navigation success in nominal conditions and recovers from failures via experience distillation, with zero-code transfer to physical robots for on-orbit tasks.
"Theater of Mind" for LLMs: A Cognitive Architecture Based on Global Workspace Theory cs.MA · 2026-04-09 · unverdicted · none · ref 8
Global Workspace Agents (GWA) is proposed as an active, event-driven cognitive architecture for LLMs featuring an entropy-based intrinsic drive and dual-layer memory to enable sustained self-directed agency.
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO cs.CL · 2026-04-30 · unverdicted · none · ref 20
Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.
CogniFold: Always-On Proactive Memory via Cognitive Folding cs.AI · 2026-05-13 · unreviewed · ref 52
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization cs.AI · 2026-05-09 · unreviewed · ref 33

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer