arxiv: 2303.11366 · v4 · submitted 2023-03-20 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn , Federico Cassano , Edward Berman , Ashwin Gopinath , Karthik Narasimhan , Shunyu Yao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords language agentsverbal reinforcement learningepisodic memoryreflexionfeedback signalstrial-and-error learningcoding benchmarksdecision making

0 comments

The pith

Language agents can learn from trial and error by storing their own verbal reflections on feedback instead of retraining model weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reflexion as a way for language model agents to improve their behavior across repeated attempts at a task. Agents generate text that reflects on external or internal feedback after each try, then keep those reflections in a memory store to shape better choices next time. This replaces the usual need for large numbers of training examples and costly parameter updates. A reader would care because it shows a lightweight route to making goal-driven agents more effective in settings such as code generation or sequential planning.

Core claim

Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. The framework accepts scalar values or free-form language from either external sources or the agent itself, and produces large gains over a baseline agent on sequential decision-making, coding, and language reasoning tasks.

What carries the argument

An episodic memory buffer that stores the agent's self-generated verbal reflections on past feedback to guide actions in later trials.

If this is right

The method yields 91 percent pass@1 accuracy on the HumanEval coding benchmark, above the prior 80 percent mark for GPT-4.
Performance rises across sequential decision-making, coding, and language reasoning when the reflection buffer is added.
The same framework handles both numeric and free-form feedback signals from external or internal sources.
Ablation tests reveal how choice of feedback type and incorporation method changes final accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results imply that language itself can act as a substitute for gradient updates when an agent must adapt to new outcomes.
Agents equipped with such a buffer may continue improving over many interactions without any external retraining step.
The approach could be tested on longer-horizon tasks where memory of past linguistic feedback becomes even more critical.

Load-bearing premise

That reflections written by the same language model will be accurate enough and relevant enough to produce reliably better choices on the next attempt.

What would settle it

Running the same agent with and without the reflection step on a held-out task and finding no consistent gain or even a drop in success rate.

read the original abstract

Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reflexion shows LLM agents can boost performance via self-generated verbal reflections in memory without fine-tuning, but misses a control for raw feedback.

read the letter

Colleague, the key takeaway from this paper is that Reflexion lets LLM agents improve their performance on interactive tasks by generating verbal reflections on feedback and storing them in an episodic memory buffer, without any model fine-tuning. They report strong results, including 91% on HumanEval compared to 80% for GPT-4. The new part is the specific setup of using the agent's own linguistic self-critique as a form of reinforcement, kept in memory for future decisions. This avoids the cost of traditional RL training. The paper shows this works across several domains: coding with HumanEval, sequential decision-making, and language reasoning tasks. They also run ablations on different feedback types and how to incorporate them, which gives some insight into what drives the improvements. Credit to them for making the approach flexible with both scalar and free-form feedback. Where it falls short is in isolating the contribution of the verbal reflection itself. The stress test points out that there's no direct comparison to a baseline that simply re-prompts with the previous code and the raw execution trace or error messages, under the same trial budget. If that simpler method gets similar results, then the whole framing around verbal reinforcement learning and the memory buffer might not be as central as claimed. The abstract talks about ablations, but this control seems absent, leaving the headline numbers a bit harder to interpret. Readers working on practical LLM agents for coding, planning, or tool use will find this relevant, especially if they're trying to avoid expensive fine-tuning. It has enough concrete benchmarks and ideas to be worth discussing in a reading group. The work shows clear thinking about how to adapt RL concepts to language models in a lightweight way. I think it deserves peer review, as the empirical claims are testable and the method is straightforward. Recommend sending it out, but ask for that missing ablation in revisions.

Referee Report

3 major / 2 minor

Summary. The paper proposes Reflexion, a framework in which LLM-based agents generate verbal reflections on task feedback (scalar or linguistic, external or simulated), store the reflections in an episodic memory buffer, and condition future generations on this text to improve performance without any weight updates. It reports substantial gains over baselines across sequential decision-making, coding, and reasoning tasks, with the headline result being 91% pass@1 on HumanEval (vs. prior SOTA of 80% for GPT-4) and includes ablations on feedback types, incorporation methods, and agent variants.

Significance. If the performance claims hold after addressing the controls below, the work would be significant for demonstrating that linguistic self-reflection can serve as an efficient, training-free mechanism for agent improvement. The 11-point HumanEval lift is notable for a coding benchmark, and the framework's flexibility with diverse feedback sources could reduce reliance on expensive fine-tuning. The reported ablation and analysis studies already provide some mechanistic insight into component contributions.

major comments (3)

[Experiments section (HumanEval subsection)] HumanEval experiments (main results table and associated text): The 91% pass@1 result is presented as evidence for the verbal-reflection-plus-memory mechanism, yet the paper does not report a control condition in which raw execution feedback (e.g., test-case error messages or compiler traces) is appended directly to the prompt for the same number of trials and the same base LLM. Without this baseline, it remains unclear whether the verbal reflection step itself drives the gain over plain GPT-4 or whether iterative prompting with unprocessed feedback would achieve comparable accuracy.
[Ablation studies] Ablation studies (Section 4 and associated tables): While ablations vary feedback signal type and incorporation method, none directly compare the full Reflexion pipeline against a version that stores and re-uses raw feedback text without an LLM-generated reflection. This omission weakens the central claim that verbal reinforcement learning (as opposed to simple feedback accumulation) is load-bearing for the observed improvements.
[Method] Method description (Section 3): The precise mechanics of episodic-memory retrieval and prompt construction are underspecified (e.g., whether the entire history is concatenated, whether reflections are summarized or truncated, and how many prior reflections are retained). These details are necessary to assess reproducibility and to understand why the buffer induces better decisions than direct feedback.

minor comments (2)

[Abstract] The abstract states that Reflexion 'obtains significant improvements over a baseline agent' but does not quantify the gains for the non-HumanEval tasks; adding one or two concrete numbers would strengthen the summary.
[Results tables] Tables reporting pass@1 or success rates should include the number of independent runs and standard deviations, given the stochasticity of LLM sampling.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the contributions and improve the reproducibility of our work. We address each major point below and commit to revisions where appropriate.

read point-by-point responses

Referee: [Experiments section (HumanEval subsection)] HumanEval experiments (main results table and associated text): The 91% pass@1 result is presented as evidence for the verbal-reflection-plus-memory mechanism, yet the paper does not report a control condition in which raw execution feedback (e.g., test-case error messages or compiler traces) is appended directly to the prompt for the same number of trials and the same base LLM. Without this baseline, it remains unclear whether the verbal reflection step itself drives the gain over plain GPT-4 or whether iterative prompting with unprocessed feedback would achieve comparable accuracy.

Authors: We agree this control is valuable for isolating the reflection mechanism. Our baseline agent receives execution feedback but does not generate verbal reflections; however, we did not explicitly test direct appending of raw feedback traces without any reflection step. In the revised manuscript we will add this exact control using GPT-4, the same trial budget, and identical prompt templates except for the absence of reflection generation. This will directly address whether verbalization is load-bearing. revision: yes
Referee: [Ablation studies] Ablation studies (Section 4 and associated tables): While ablations vary feedback signal type and incorporation method, none directly compare the full Reflexion pipeline against a version that stores and re-uses raw feedback text without an LLM-generated reflection. This omission weakens the central claim that verbal reinforcement learning (as opposed to simple feedback accumulation) is load-bearing for the observed improvements.

Authors: We acknowledge the gap. Our existing ablations examine feedback types and incorporation strategies, yet they do not include a pure raw-feedback storage baseline. We will add this comparison in the revised Section 4, reporting performance when the episodic buffer stores and re-injects raw execution traces without LLM-generated reflections. This will provide direct evidence on the necessity of the verbal reflection step. revision: yes
Referee: [Method] Method description (Section 3): The precise mechanics of episodic-memory retrieval and prompt construction are underspecified (e.g., whether the entire history is concatenated, whether reflections are summarized or truncated, and how many prior reflections are retained). These details are necessary to assess reproducibility and to understand why the buffer induces better decisions than direct feedback.

Authors: We will expand Section 3 with additional detail and pseudocode. The revised text will specify: (1) the buffer stores up to k most recent reflections (k=3 in our experiments), (2) retrieval concatenates all retained reflections in reverse chronological order without summarization, (3) truncation occurs only if total tokens exceed the model context limit by dropping oldest entries first, and (4) the prompt template explicitly places the memory buffer before the current task description. These clarifications will improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains from verbal reflection framework

full rationale

The paper proposes Reflexion as an empirical framework where LLMs generate verbal reflections on task feedback, store them in episodic memory, and condition future generations on that text to improve performance without weight updates. The central results are direct pass@1 accuracy comparisons on benchmarks such as HumanEval (91% vs. prior 80% GPT-4), with ablations on feedback types and incorporation methods. No equations, fitted parameters, or self-referential definitions appear in the derivation; the reported improvements are measured outcomes from iterative prompting experiments rather than quantities forced by construction from the inputs. The method's effectiveness is presented as an empirical finding open to external validation, with no load-bearing self-citations or ansatzes that collapse the claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that current LLMs can generate and use self-reflections effectively; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Large language models can generate meaningful verbal reflections from task feedback that improve subsequent decisions when stored and retrieved.
This is the central premise enabling the memory buffer to function as reinforcement without weight updates.

invented entities (1)

Episodic memory buffer storing reflective text no independent evidence
purpose: Maintains history of self-generated reflections to condition future agent actions.
New architectural component introduced by the framework.

pith-pipeline@v0.9.0 · 5524 in / 1263 out tokens · 50911 ms · 2026-05-10T13:47:03.070574+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
WebArena: A Realistic Web Environment for Building Autonomous Agents
cs.AI 2023-07 accept novelty 8.0

WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
cs.CV 2026-05 unverdicted novelty 7.0

ProtoMedAgent uses a privacy-aware agentic workflow with neuro-symbolic bottlenecks to achieve 91.2% faithfulness in clinical report generation, significantly outperforming standard RAG methods on a large patient cohort.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
cs.SE 2026-05 unverdicted novelty 7.0

PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
cs.AI 2026-05 unverdicted novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
cs.AI 2026-05 unverdicted novelty 7.0

MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
BIM Information Extraction Through LLM-based Adaptive Exploration
cs.CL 2026-05 unverdicted novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
cs.LG 2026-05 unverdicted novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
cs.LG 2026-05 unverdicted novelty 7.0

InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
cs.SE 2026-04 unverdicted novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow
cs.SE 2026-04 unverdicted novelty 7.0

RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
cs.CR 2026-04 unverdicted novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
cs.AI 2026-04 unverdicted novelty 7.0

HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication
cs.AI 2026-04 unverdicted novelty 7.0

A new structured prompting method (SPEC) helps AI detect insufficient evidence in adjudication tasks and defer decisions appropriately, reaching 89% accuracy on a benchmark varying information completeness from Colora...
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
cs.CL 2026-04 unverdicted novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
cs.CR 2026-04 unverdicted novelty 7.0

Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...
AI scientists produce results without reasoning scientifically
cs.AI 2026-04 conditional novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
cs.LG 2026-04 unverdicted novelty 7.0

SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
cs.RO 2026-04 conditional novelty 7.0

A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
cs.AI 2026-04 unverdicted novelty 7.0

Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
cond-mat.mtrl-sci 2026-04 conditional novelty 7.0

MatClaw is a code-first LLM agent that autonomously executes end-to-end materials workflows by generating and running Python scripts on remote clusters, achieving reliable code generation via memory architecture and R...
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
cs.NE 2026-03 unverdicted novelty 7.0

BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
cs.SE 2026-03 accept novelty 7.0

LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
cs.AI 2024-05 accept novelty 7.0

AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
Large Language Models as Optimizers
cs.LG 2023-09 unverdicted novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
gr-qc 2026-05 unverdicted novelty 6.0

LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
cs.AI 2026-05 unverdicted novelty 6.0

PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
LoopTrap: Termination Poisoning Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 6.0

LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact ...
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
cs.SE 2026-05 unverdicted novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
cs.LG 2026-05 unverdicted novelty 6.0

InvEvolve uses LLMs and RL to generate certified inventory policies that outperform classical and deep learning methods on synthetic and real data while providing multi-period performance guarantees.
Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents
cs.AI 2026-04 unverdicted novelty 6.0

A dedicated reviewer agent supplies inference-time feedback on provisional tool calls, yielding gains on BFCL and Tau2-Bench while quantifying helpfulness versus harmfulness tradeoffs.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
cs.AI 2026-04 unverdicted novelty 6.0

A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
Thinking with Reasoning Skills: Fewer Tokens, More Accuracy
cs.AI 2026-04 unverdicted novelty 6.0

Distilling and retrieving reusable reasoning skills lets LLMs solve coding and math problems with fewer tokens and higher accuracy.
You Don't Need Public Tests to Generate Correct Code
cs.SE 2026-04 unverdicted novelty 6.0

DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...
Job Skill Extraction via LLM-Centric Multi-Module Framework
cs.CL 2026-04 unverdicted novelty 6.0

SRICL combines semantic retrieval from ESCO, in-context learning, fine-tuning, and output verification to achieve higher STRICT-F1 scores and fewer invalid or hallucinated skill spans than GPT-3.5 baselines on six pub...
HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation
cs.LG 2026-04 unverdicted novelty 6.0

HELM raises long-horizon VLA success from 58.4% to 81.5% on LIBERO-LONG by combining episodic memory retrieval, learned failure prediction, and replanning, outperforming context extension or adaptation alone.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 unverdicted novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
cs.MA 2026-04 unverdicted novelty 6.0

QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
cs.CL 2026-04 unverdicted novelty 6.0

AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 91 Pith papers · 11 internal anchors

[1]

Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691

work page internal anchor Pith review arXiv 2022
[2]

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. (2021). Program synthesis with large language models. arXiv preprint arXiv:2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

L., and Singh, S

Brooks, E., Walls, L., Lewis, R. L., and Singh, S. (2022). In-context policy iteration. arXiv preprint arXiv:2210.03821

work page arXiv 2022
[4]

J., Feldman, M

Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps-Costin, L., Pinckney, D., Yee, M.-H., Zi, Y ., Anderson, C. J., Feldman, M. Q., Guha, A., Greenberg, M., and Jangda, A. (2022). Multipl-e: A scalable and extensible approach to benchmarking neural code generation

work page 2022
[5]

Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J.-G., and Chen, W. (2022). Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397

work page arXiv 2022
[6]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Chen, X., Lin, M., Schärli, N., and Zhou, D. (2023). Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128

work page internal anchor Pith review arXiv 2023
[8]

Côté, M.-A., Kádár, A., Yuan, X., Kybartas, B., Barnes, T., Fine, E., Moore, J., Hausknecht, M., El Asri, L., Adada, M., et al. (2019). Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July...

work page 2019
[9]

Goodman, N. (2023). Meta-prompt: A simple self-improving language agent. noahgood- man.substack.com

work page 2023
[10]

Kim, G., Baldi, P., and McAleer, S. (2023). Language models can solve computer tasks. arXiv preprint arXiv:2303.17491

work page arXiv 2023
[11]

Lam, W., Winter, S., Wei, A., Xie, T., Marinov, D., and Bell, J. (2020). A large-scale longitudinal study of flaky tests. Proc. ACM Program. Lang., 4(OOPSLA)

work page 2020
[12]

D., Savarese, S., and Hoi, S

Le, H., Wang, Y ., Gotmare, A. D., Savarese, S., and Hoi, S. C. H. (2022). Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328

work page 2022
[13]

StarCoder: may the source be with you!

Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161

work page internal anchor Pith review arXiv 2023
[14]

Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. (2022). Competition-level code generation with alphacode. Science, 378(6624):1092–1097

work page 2022
[15]

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., et al. (2023). Self-refine: Iterative refinement with self-feedback.arXiv preprint arXiv:2303.17651

work page internal anchor Pith review arXiv 2023
[16]

Nair, V ., Schumacher, E., Tso, G., and Kannan, A. (2023). Dera: Enhancing large language model completions with dialog-enabled resolving agents. arXiv preprint arXiv:2303.17071

work page arXiv 2023
[17]

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332

work page internal anchor Pith review arXiv 2021
[18]

Gpt-4 technical report

OpenAI (2023). Gpt-4 technical report. ArXiv. 10

work page 2023
[19]

Generative Agents: Interactive Simulacra of Human Behavior

Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442

work page internal anchor Pith review arXiv 2023
[20]

Paul, D., Ismayilzada, M., Peyrard, M., Borges, B., Bosselut, A., West, R., and Faltings, B. (2023). Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904

work page arXiv 2023
[21]

gradient descent

Pryzant, R., Iter, D., Li, J., Lee, Y . T., Zhu, C., and Zeng, M. (2023). Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495

work page arXiv 2023
[22]

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761

work page internal anchor Pith review arXiv 2023
[23]

Shen, Y ., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y . (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580

work page internal anchor Pith review arXiv 2023
[24]

Shridhar, M., Yuan, X., Côté, M.-A., Bisk, Y ., Trischler, A., and Hausknecht, M. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR)

work page 2021
[25]

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. The MIT Press, second edition

work page 2018
[26]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Xie, Y ., Kawaguchi, K., Zhao, Y ., Zhao, X., Kan, M.-Y ., He, J., and Xie, Q. (2023). Decomposi- tion enhances reasoning via self-evaluation guided decoding. arXiv preprint arXiv:2305.00633

work page arXiv 2023
[28]

W., Salakhutdinov, R., and Manning, C

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. InConference on Empirical Methods in Natural Language Processing (EMNLP)

work page 2018
[29]

(preprint)

Yao, S., Chen, H., Yang, J., and Narasimhan, K. (preprint). Webshop: Towards scalable real-world web interaction with grounded language agents. In ArXiv

work page
[30]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . (2023). ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR)

work page 2023
[31]

Answering questions by meta-reasoning over multiple chains of thought

Yoran, O., Wolfson, T., Bogin, B., Katz, U., Deutch, D., and Berant, J. (2023). Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007. 11 A Evaluation with additional models We further investigated the applicability of trial-and-error problem-solving with models of various strengths. We found that the abili...

work page arXiv 2023