pith. sign in

arxiv: 2303.11366 · v4 · submitted 2023-03-20 · 💻 cs.AI · cs.CL· cs.LG

Reflexion: Language Agents with Verbal Reinforcement Learning

Pith reviewed 2026-05-10 13:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords language agentsverbal reinforcement learningepisodic memoryreflexionfeedback signalstrial-and-error learningcoding benchmarksdecision making
0
0 comments X

The pith

Language agents can learn from trial and error by storing their own verbal reflections on feedback instead of retraining model weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reflexion as a way for language model agents to improve their behavior across repeated attempts at a task. Agents generate text that reflects on external or internal feedback after each try, then keep those reflections in a memory store to shape better choices next time. This replaces the usual need for large numbers of training examples and costly parameter updates. A reader would care because it shows a lightweight route to making goal-driven agents more effective in settings such as code generation or sequential planning.

Core claim

Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. The framework accepts scalar values or free-form language from either external sources or the agent itself, and produces large gains over a baseline agent on sequential decision-making, coding, and language reasoning tasks.

What carries the argument

An episodic memory buffer that stores the agent's self-generated verbal reflections on past feedback to guide actions in later trials.

If this is right

  • The method yields 91 percent pass@1 accuracy on the HumanEval coding benchmark, above the prior 80 percent mark for GPT-4.
  • Performance rises across sequential decision-making, coding, and language reasoning when the reflection buffer is added.
  • The same framework handles both numeric and free-form feedback signals from external or internal sources.
  • Ablation tests reveal how choice of feedback type and incorporation method changes final accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results imply that language itself can act as a substitute for gradient updates when an agent must adapt to new outcomes.
  • Agents equipped with such a buffer may continue improving over many interactions without any external retraining step.
  • The approach could be tested on longer-horizon tasks where memory of past linguistic feedback becomes even more critical.

Load-bearing premise

That reflections written by the same language model will be accurate enough and relevant enough to produce reliably better choices on the next attempt.

What would settle it

Running the same agent with and without the reflection step on a held-out task and finding no consistent gain or even a drop in success rate.

read the original abstract

Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Reflexion, a framework in which LLM-based agents generate verbal reflections on task feedback (scalar or linguistic, external or simulated), store the reflections in an episodic memory buffer, and condition future generations on this text to improve performance without any weight updates. It reports substantial gains over baselines across sequential decision-making, coding, and reasoning tasks, with the headline result being 91% pass@1 on HumanEval (vs. prior SOTA of 80% for GPT-4) and includes ablations on feedback types, incorporation methods, and agent variants.

Significance. If the performance claims hold after addressing the controls below, the work would be significant for demonstrating that linguistic self-reflection can serve as an efficient, training-free mechanism for agent improvement. The 11-point HumanEval lift is notable for a coding benchmark, and the framework's flexibility with diverse feedback sources could reduce reliance on expensive fine-tuning. The reported ablation and analysis studies already provide some mechanistic insight into component contributions.

major comments (3)
  1. [Experiments section (HumanEval subsection)] HumanEval experiments (main results table and associated text): The 91% pass@1 result is presented as evidence for the verbal-reflection-plus-memory mechanism, yet the paper does not report a control condition in which raw execution feedback (e.g., test-case error messages or compiler traces) is appended directly to the prompt for the same number of trials and the same base LLM. Without this baseline, it remains unclear whether the verbal reflection step itself drives the gain over plain GPT-4 or whether iterative prompting with unprocessed feedback would achieve comparable accuracy.
  2. [Ablation studies] Ablation studies (Section 4 and associated tables): While ablations vary feedback signal type and incorporation method, none directly compare the full Reflexion pipeline against a version that stores and re-uses raw feedback text without an LLM-generated reflection. This omission weakens the central claim that verbal reinforcement learning (as opposed to simple feedback accumulation) is load-bearing for the observed improvements.
  3. [Method] Method description (Section 3): The precise mechanics of episodic-memory retrieval and prompt construction are underspecified (e.g., whether the entire history is concatenated, whether reflections are summarized or truncated, and how many prior reflections are retained). These details are necessary to assess reproducibility and to understand why the buffer induces better decisions than direct feedback.
minor comments (2)
  1. [Abstract] The abstract states that Reflexion 'obtains significant improvements over a baseline agent' but does not quantify the gains for the non-HumanEval tasks; adding one or two concrete numbers would strengthen the summary.
  2. [Results tables] Tables reporting pass@1 or success rates should include the number of independent runs and standard deviations, given the stochasticity of LLM sampling.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the contributions and improve the reproducibility of our work. We address each major point below and commit to revisions where appropriate.

read point-by-point responses
  1. Referee: [Experiments section (HumanEval subsection)] HumanEval experiments (main results table and associated text): The 91% pass@1 result is presented as evidence for the verbal-reflection-plus-memory mechanism, yet the paper does not report a control condition in which raw execution feedback (e.g., test-case error messages or compiler traces) is appended directly to the prompt for the same number of trials and the same base LLM. Without this baseline, it remains unclear whether the verbal reflection step itself drives the gain over plain GPT-4 or whether iterative prompting with unprocessed feedback would achieve comparable accuracy.

    Authors: We agree this control is valuable for isolating the reflection mechanism. Our baseline agent receives execution feedback but does not generate verbal reflections; however, we did not explicitly test direct appending of raw feedback traces without any reflection step. In the revised manuscript we will add this exact control using GPT-4, the same trial budget, and identical prompt templates except for the absence of reflection generation. This will directly address whether verbalization is load-bearing. revision: yes

  2. Referee: [Ablation studies] Ablation studies (Section 4 and associated tables): While ablations vary feedback signal type and incorporation method, none directly compare the full Reflexion pipeline against a version that stores and re-uses raw feedback text without an LLM-generated reflection. This omission weakens the central claim that verbal reinforcement learning (as opposed to simple feedback accumulation) is load-bearing for the observed improvements.

    Authors: We acknowledge the gap. Our existing ablations examine feedback types and incorporation strategies, yet they do not include a pure raw-feedback storage baseline. We will add this comparison in the revised Section 4, reporting performance when the episodic buffer stores and re-injects raw execution traces without LLM-generated reflections. This will provide direct evidence on the necessity of the verbal reflection step. revision: yes

  3. Referee: [Method] Method description (Section 3): The precise mechanics of episodic-memory retrieval and prompt construction are underspecified (e.g., whether the entire history is concatenated, whether reflections are summarized or truncated, and how many prior reflections are retained). These details are necessary to assess reproducibility and to understand why the buffer induces better decisions than direct feedback.

    Authors: We will expand Section 3 with additional detail and pseudocode. The revised text will specify: (1) the buffer stores up to k most recent reflections (k=3 in our experiments), (2) retrieval concatenates all retained reflections in reverse chronological order without summarization, (3) truncation occurs only if total tokens exceed the model context limit by dropping oldest entries first, and (4) the prompt template explicitly places the memory buffer before the current task description. These clarifications will improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains from verbal reflection framework

full rationale

The paper proposes Reflexion as an empirical framework where LLMs generate verbal reflections on task feedback, store them in episodic memory, and condition future generations on that text to improve performance without weight updates. The central results are direct pass@1 accuracy comparisons on benchmarks such as HumanEval (91% vs. prior 80% GPT-4), with ablations on feedback types and incorporation methods. No equations, fitted parameters, or self-referential definitions appear in the derivation; the reported improvements are measured outcomes from iterative prompting experiments rather than quantities forced by construction from the inputs. The method's effectiveness is presented as an empirical finding open to external validation, with no load-bearing self-citations or ansatzes that collapse the claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that current LLMs can generate and use self-reflections effectively; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Large language models can generate meaningful verbal reflections from task feedback that improve subsequent decisions when stored and retrieved.
    This is the central premise enabling the memory buffer to function as reinforcement without weight updates.
invented entities (1)
  • Episodic memory buffer storing reflective text no independent evidence
    purpose: Maintains history of self-generated reflections to condition future agent actions.
    New architectural component introduced by the framework.

pith-pipeline@v0.9.0 · 5524 in / 1263 out tokens · 50911 ms · 2026-05-10T13:47:03.070574+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.

  2. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

  3. ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

    cs.CR 2026-05 unverdicted novelty 8.0

    ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.

  4. Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

    quant-ph 2025-10 accept novelty 8.0 full

    A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructio...

  5. ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

    cs.CR 2025-07 unverdicted novelty 8.0

    ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

  6. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    cs.CL 2023-10 conditional novelty 8.0

    DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

  7. Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    cs.CL 2023-09 unverdicted novelty 8.0

    Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

  8. WebArena: A Realistic Web Environment for Building Autonomous Agents

    cs.AI 2023-07 accept novelty 8.0

    WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.

  9. Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.

  10. Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

    cs.CL 2026-05 unverdicted novelty 7.0

    Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the mode...

  11. Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

    cs.CL 2026-05 unverdicted novelty 7.0

    Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

  12. HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

    cs.CR 2026-05 unverdicted novelty 7.0

    HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false posi...

  13. A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

  14. Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.

  15. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

    cs.AI 2026-05 unverdicted novelty 7.0

    DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...

  16. ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

    cs.CV 2026-05 unverdicted novelty 7.0

    ProtoMedAgent uses a privacy-aware agentic workflow with neuro-symbolic bottlenecks to achieve 91.2% faithfulness in clinical report generation, significantly outperforming standard RAG methods on a large patient cohort.

  17. Test-Time Hinting for Black-Box Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.

  18. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  19. Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

  20. Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

    cs.SE 2026-05 unverdicted novelty 7.0

    PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.

  21. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  22. Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

    cs.AI 2026-05 unverdicted novelty 7.0

    In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...

  23. MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing

    cs.AI 2026-05 unverdicted novelty 7.0

    MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.

  24. Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...

  25. BIM Information Extraction Through LLM-based Adaptive Exploration

    cs.CL 2026-05 unverdicted novelty 7.0

    LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.

  26. From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework

    cs.LG 2026-05 unverdicted novelty 7.0

    AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...

  27. InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees

    cs.LG 2026-05 unverdicted novelty 7.0

    InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.

  28. Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

    cs.SE 2026-04 unverdicted novelty 7.0

    Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.

  29. OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

    cs.CL 2026-04 unverdicted novelty 7.0

    OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

  30. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  31. RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

    cs.SE 2026-04 unverdicted novelty 7.0

    RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.

  32. Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

    cs.CR 2026-04 unverdicted novelty 7.0

    AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...

  33. HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

    cs.AI 2026-04 unverdicted novelty 7.0

    HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.

  34. Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication

    cs.AI 2026-04 unverdicted novelty 7.0

    A new structured prompting method (SPEC) helps AI detect insufficient evidence in adjudication tasks and defer decisions appropriately, reaching 89% accuracy on a benchmark varying information completeness from Colora...

  35. Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

    cs.CL 2026-04 unverdicted novelty 7.0

    Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.

  36. Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

    cs.CR 2026-04 unverdicted novelty 7.0

    Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...

  37. AI scientists produce results without reasoning scientifically

    cs.AI 2026-04 conditional novelty 7.0

    LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

  38. SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

    cs.LG 2026-04 unverdicted novelty 7.0

    SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

  39. Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

    cs.RO 2026-04 conditional novelty 7.0

    A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.

  40. Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception

    cs.AI 2026-04 unverdicted novelty 7.0

    Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...

  41. MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration

    cond-mat.mtrl-sci 2026-04 conditional novelty 7.0

    MatClaw shows a code-first LLM agent autonomously generating and executing workflows for ML force field training, Curie temperature prediction, and parameter search on CuInP2S6, succeeding on code but requiring interv...

  42. MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration

    cond-mat.mtrl-sci 2026-04 conditional novelty 7.0

    MatClaw is a code-first LLM agent that autonomously executes end-to-end materials workflows by generating and running Python scripts on remote clusters, achieving reliable code generation via memory architecture and R...

  43. BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations

    cs.NE 2026-03 unverdicted novelty 7.0

    BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.

  44. LLM4Log: A Systematic Review of Large Language Model-based Log Analysis

    cs.SE 2026-03 accept novelty 7.0

    LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.

  45. LETGAMES: An LLM-Powered Gamified Approach to Cognitive Training for Patients with Cognitive Impairment

    cs.HC 2026-02 unverdicted novelty 7.0

    LETGAMES uses LLMs to generate open-world D&D-inspired games with conversational guidance for personalized cognitive training, validated through a new psychology-grounded evaluation protocol showing promise in LLM and...

  46. MemEvolve: Meta-Evolution of Agent Memory Systems

    cs.CL 2025-12 unverdicted novelty 7.0

    MemEvolve jointly evolves agent experiential knowledge and memory architectures via a modular codebase, delivering up to 17% gains on agent benchmarks with cross-task and cross-model generalization.

  47. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    cs.AI 2024-05 accept novelty 7.0

    AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.

  48. Large Language Models as Optimizers

    cs.LG 2023-09 unverdicted novelty 7.0

    Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...

  49. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  50. What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

    cs.CL 2026-05 unverdicted novelty 6.0

    Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggrega...

  51. Reinforcing Human Behavior Simulation via Verbal Feedback

    cs.LG 2026-05 unverdicted novelty 6.0

    DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

  52. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-...

  53. optimize_anything: A Universal API for Optimizing any Text Parameter

    cs.CL 2026-05 unverdicted novelty 6.0

    A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.

  54. Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

    cs.LG 2026-05 unverdicted novelty 6.0

    TTRL-Guard mitigates the Correct-Answer Extinction Window in test-time RL via flip-rate-aware reward scaling, minority-preserving sampling, and risk-conditioned sparse updates, yielding best average pass@1 on Qwen mod...

  55. An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

    cs.CR 2026-05 unverdicted novelty 6.0

    Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.

  56. ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

    cs.SE 2026-05 unverdicted novelty 6.0

    ContraFix couples differential runtime evidence from execution variants with reusable repair skills to achieve 84.0% resolution on SEC-Bench and 73.8% on PatchEval using GPT-5-mini, outperforming baselines at lower cost.

  57. The Scaling Laws of Skills in LLM Agent Systems

    cs.CL 2026-05 unverdicted novelty 6.0

    Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations...

  58. Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering

    cs.CL 2026-05 conditional novelty 6.0

    NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.

  59. Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

    cs.AI 2026-05 unverdicted novelty 6.0

    Solvita is an agentic evolution system using Planner, Solver, Oracle, and Hacker agents with trainable graph knowledge networks updated by reinforcement learning on pass/fail and vulnerability signals to achieve SOTA ...

  60. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 142 Pith papers · 13 internal anchors

  1. [1]

    Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691

  2. [2]

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. (2021). Program synthesis with large language models. arXiv preprint arXiv:2108.07732

  3. [3]

    L., and Singh, S

    Brooks, E., Walls, L., Lewis, R. L., and Singh, S. (2022). In-context policy iteration. arXiv preprint arXiv:2210.03821

  4. [4]

    J., Feldman, M

    Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps-Costin, L., Pinckney, D., Yee, M.-H., Zi, Y ., Anderson, C. J., Feldman, M. Q., Guha, A., Greenberg, M., and Jangda, A. (2022). Multipl-e: A scalable and extensible approach to benchmarking neural code generation

  5. [5]

    Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J.-G., and Chen, W. (2022). Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397

  6. [6]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

  7. [7]

    Chen, X., Lin, M., Schärli, N., and Zhou, D. (2023). Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128

  8. [8]

    Côté, M.-A., Kádár, A., Yuan, X., Kybartas, B., Barnes, T., Fine, E., Moore, J., Hausknecht, M., El Asri, L., Adada, M., et al. (2019). Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July...

  9. [9]

    Goodman, N. (2023). Meta-prompt: A simple self-improving language agent. noahgood- man.substack.com

  10. [10]

    Kim, G., Baldi, P., and McAleer, S. (2023). Language models can solve computer tasks. arXiv preprint arXiv:2303.17491

  11. [11]

    Lam, W., Winter, S., Wei, A., Xie, T., Marinov, D., and Bell, J. (2020). A large-scale longitudinal study of flaky tests. Proc. ACM Program. Lang., 4(OOPSLA)

  12. [12]

    D., Savarese, S., and Hoi, S

    Le, H., Wang, Y ., Gotmare, A. D., Savarese, S., and Hoi, S. C. H. (2022). Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328

  13. [13]

    StarCoder: may the source be with you!

    Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161

  14. [14]

    Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. (2022). Competition-level code generation with alphacode. Science, 378(6624):1092–1097

  15. [15]

    Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., et al. (2023). Self-refine: Iterative refinement with self-feedback.arXiv preprint arXiv:2303.17651

  16. [16]

    Nair, V ., Schumacher, E., Tso, G., and Kannan, A. (2023). Dera: Enhancing large language model completions with dialog-enabled resolving agents. arXiv preprint arXiv:2303.17071

  17. [17]

    Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332

  18. [18]

    Gpt-4 technical report

    OpenAI (2023). Gpt-4 technical report. ArXiv. 10

  19. [19]

    Generative Agents: Interactive Simulacra of Human Behavior

    Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442

  20. [20]

    Paul, D., Ismayilzada, M., Peyrard, M., Borges, B., Bosselut, A., West, R., and Faltings, B. (2023). Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904

  21. [21]

    Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C

    Pryzant, R., Iter, D., Li, J., Lee, Y . T., Zhu, C., and Zeng, M. (2023). Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495

  22. [22]

    Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761

  23. [23]

    Shen, Y ., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y . (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580

  24. [24]

    Shridhar, M., Yuan, X., Côté, M.-A., Bisk, Y ., Trischler, A., and Hausknecht, M. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR)

  25. [25]

    Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. The MIT Press, second edition

  26. [26]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903

  27. [27]

    Xie, Y ., Kawaguchi, K., Zhao, Y ., Zhao, X., Kan, M.-Y ., He, J., and Xie, Q. (2023). Decomposi- tion enhances reasoning via self-evaluation guided decoding. arXiv preprint arXiv:2305.00633

  28. [28]

    W., Salakhutdinov, R., and Manning, C

    Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. InConference on Empirical Methods in Natural Language Processing (EMNLP)

  29. [29]

    (preprint)

    Yao, S., Chen, H., Yang, J., and Narasimhan, K. (preprint). Webshop: Towards scalable real-world web interaction with grounded language agents. In ArXiv

  30. [30]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . (2023). ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR)

  31. [31]

    Yoran, T

    Yoran, O., Wolfson, T., Bogin, B., Katz, U., Deutch, D., and Berant, J. (2023). Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007. 11 A Evaluation with additional models We further investigated the applicability of trial-and-error problem-solving with models of various strengths. We found that the abili...