Reflexion: Language Agents with Verbal Reinforcement Learning
Pith reviewed 2026-05-10 13:47 UTC · model grok-4.3
The pith
Language agents can learn from trial and error by storing their own verbal reflections on feedback instead of retraining model weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. The framework accepts scalar values or free-form language from either external sources or the agent itself, and produces large gains over a baseline agent on sequential decision-making, coding, and language reasoning tasks.
What carries the argument
An episodic memory buffer that stores the agent's self-generated verbal reflections on past feedback to guide actions in later trials.
If this is right
- The method yields 91 percent pass@1 accuracy on the HumanEval coding benchmark, above the prior 80 percent mark for GPT-4.
- Performance rises across sequential decision-making, coding, and language reasoning when the reflection buffer is added.
- The same framework handles both numeric and free-form feedback signals from external or internal sources.
- Ablation tests reveal how choice of feedback type and incorporation method changes final accuracy.
Where Pith is reading between the lines
- The results imply that language itself can act as a substitute for gradient updates when an agent must adapt to new outcomes.
- Agents equipped with such a buffer may continue improving over many interactions without any external retraining step.
- The approach could be tested on longer-horizon tasks where memory of past linguistic feedback becomes even more critical.
Load-bearing premise
That reflections written by the same language model will be accurate enough and relevant enough to produce reliably better choices on the next attempt.
What would settle it
Running the same agent with and without the reflection step on a held-out task and finding no consistent gain or even a drop in success rate.
read the original abstract
Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Reflexion, a framework in which LLM-based agents generate verbal reflections on task feedback (scalar or linguistic, external or simulated), store the reflections in an episodic memory buffer, and condition future generations on this text to improve performance without any weight updates. It reports substantial gains over baselines across sequential decision-making, coding, and reasoning tasks, with the headline result being 91% pass@1 on HumanEval (vs. prior SOTA of 80% for GPT-4) and includes ablations on feedback types, incorporation methods, and agent variants.
Significance. If the performance claims hold after addressing the controls below, the work would be significant for demonstrating that linguistic self-reflection can serve as an efficient, training-free mechanism for agent improvement. The 11-point HumanEval lift is notable for a coding benchmark, and the framework's flexibility with diverse feedback sources could reduce reliance on expensive fine-tuning. The reported ablation and analysis studies already provide some mechanistic insight into component contributions.
major comments (3)
- [Experiments section (HumanEval subsection)] HumanEval experiments (main results table and associated text): The 91% pass@1 result is presented as evidence for the verbal-reflection-plus-memory mechanism, yet the paper does not report a control condition in which raw execution feedback (e.g., test-case error messages or compiler traces) is appended directly to the prompt for the same number of trials and the same base LLM. Without this baseline, it remains unclear whether the verbal reflection step itself drives the gain over plain GPT-4 or whether iterative prompting with unprocessed feedback would achieve comparable accuracy.
- [Ablation studies] Ablation studies (Section 4 and associated tables): While ablations vary feedback signal type and incorporation method, none directly compare the full Reflexion pipeline against a version that stores and re-uses raw feedback text without an LLM-generated reflection. This omission weakens the central claim that verbal reinforcement learning (as opposed to simple feedback accumulation) is load-bearing for the observed improvements.
- [Method] Method description (Section 3): The precise mechanics of episodic-memory retrieval and prompt construction are underspecified (e.g., whether the entire history is concatenated, whether reflections are summarized or truncated, and how many prior reflections are retained). These details are necessary to assess reproducibility and to understand why the buffer induces better decisions than direct feedback.
minor comments (2)
- [Abstract] The abstract states that Reflexion 'obtains significant improvements over a baseline agent' but does not quantify the gains for the non-HumanEval tasks; adding one or two concrete numbers would strengthen the summary.
- [Results tables] Tables reporting pass@1 or success rates should include the number of independent runs and standard deviations, given the stochasticity of LLM sampling.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the contributions and improve the reproducibility of our work. We address each major point below and commit to revisions where appropriate.
read point-by-point responses
-
Referee: [Experiments section (HumanEval subsection)] HumanEval experiments (main results table and associated text): The 91% pass@1 result is presented as evidence for the verbal-reflection-plus-memory mechanism, yet the paper does not report a control condition in which raw execution feedback (e.g., test-case error messages or compiler traces) is appended directly to the prompt for the same number of trials and the same base LLM. Without this baseline, it remains unclear whether the verbal reflection step itself drives the gain over plain GPT-4 or whether iterative prompting with unprocessed feedback would achieve comparable accuracy.
Authors: We agree this control is valuable for isolating the reflection mechanism. Our baseline agent receives execution feedback but does not generate verbal reflections; however, we did not explicitly test direct appending of raw feedback traces without any reflection step. In the revised manuscript we will add this exact control using GPT-4, the same trial budget, and identical prompt templates except for the absence of reflection generation. This will directly address whether verbalization is load-bearing. revision: yes
-
Referee: [Ablation studies] Ablation studies (Section 4 and associated tables): While ablations vary feedback signal type and incorporation method, none directly compare the full Reflexion pipeline against a version that stores and re-uses raw feedback text without an LLM-generated reflection. This omission weakens the central claim that verbal reinforcement learning (as opposed to simple feedback accumulation) is load-bearing for the observed improvements.
Authors: We acknowledge the gap. Our existing ablations examine feedback types and incorporation strategies, yet they do not include a pure raw-feedback storage baseline. We will add this comparison in the revised Section 4, reporting performance when the episodic buffer stores and re-injects raw execution traces without LLM-generated reflections. This will provide direct evidence on the necessity of the verbal reflection step. revision: yes
-
Referee: [Method] Method description (Section 3): The precise mechanics of episodic-memory retrieval and prompt construction are underspecified (e.g., whether the entire history is concatenated, whether reflections are summarized or truncated, and how many prior reflections are retained). These details are necessary to assess reproducibility and to understand why the buffer induces better decisions than direct feedback.
Authors: We will expand Section 3 with additional detail and pseudocode. The revised text will specify: (1) the buffer stores up to k most recent reflections (k=3 in our experiments), (2) retrieval concatenates all retained reflections in reverse chronological order without summarization, (3) truncation occurs only if total tokens exceed the model context limit by dropping oldest entries first, and (4) the prompt template explicitly places the memory buffer before the current task description. These clarifications will improve reproducibility. revision: yes
Circularity Check
No circularity: empirical benchmark gains from verbal reflection framework
full rationale
The paper proposes Reflexion as an empirical framework where LLMs generate verbal reflections on task feedback, store them in episodic memory, and condition future generations on that text to improve performance without weight updates. The central results are direct pass@1 accuracy comparisons on benchmarks such as HumanEval (91% vs. prior 80% GPT-4), with ablations on feedback types and incorporation methods. No equations, fitted parameters, or self-referential definitions appear in the derivation; the reported improvements are measured outcomes from iterative prompting experiments rather than quantities forced by construction from the inputs. The method's effectiveness is presented as an empirical finding open to external validation, with no load-bearing self-citations or ansatzes that collapse the claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can generate meaningful verbal reflections from task feedback that improve subsequent decisions when stored and retrieved.
invented entities (1)
-
Episodic memory buffer storing reflective text
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
-
Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructio...
-
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
WebArena: A Realistic Web Environment for Building Autonomous Agents
WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.
-
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the mode...
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
-
HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection
HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false posi...
-
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
-
Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents
Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.
-
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
-
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
ProtoMedAgent uses a privacy-aware agentic workflow with neuro-symbolic bottlenecks to achieve 91.2% faithfulness in clinical report generation, significantly outperforming standard RAG methods on a large patient cohort.
-
Test-Time Hinting for Black-Box Vision-Language Models
Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
-
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
-
BIM Information Extraction Through LLM-based Adaptive Exploration
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
-
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
-
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
-
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
-
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
-
Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication
A new structured prompting method (SPEC) helps AI detect insufficient evidence in adjudication tasks and defer decisions appropriately, reaching 89% accuracy on a benchmark varying information completeness from Colora...
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
-
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.
-
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
-
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
MatClaw shows a code-first LLM agent autonomously generating and executing workflows for ML force field training, Curie temperature prediction, and parameter search on CuInP2S6, succeeding on code but requiring interv...
-
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
MatClaw is a code-first LLM agent that autonomously executes end-to-end materials workflows by generating and running Python scripts on remote clusters, achieving reliable code generation via memory architecture and R...
-
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
-
LETGAMES: An LLM-Powered Gamified Approach to Cognitive Training for Patients with Cognitive Impairment
LETGAMES uses LLMs to generate open-world D&D-inspired games with conversational guidance for personalized cognitive training, validated through a new psychology-grounded evaluation protocol showing promise in LLM and...
-
MemEvolve: Meta-Evolution of Agent Memory Systems
MemEvolve jointly evolves agent experiential knowledge and memory architectures via a modular codebase, delivering up to 17% gains on agent benchmarks with cross-task and cross-model generalization.
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggrega...
-
Reinforcing Human Behavior Simulation via Verbal Feedback
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
-
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-...
-
optimize_anything: A Universal API for Optimizing any Text Parameter
A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.
-
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting
TTRL-Guard mitigates the Correct-Answer Extinction Window in test-time RL via flip-rate-aware reward scaling, minority-preserving sampling, and risk-conditioned sparse updates, yielding best average pass@1 on Qwen mod...
-
An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments
Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.
-
ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse
ContraFix couples differential runtime evidence from execution variants with reusable repair skills to achieve 84.0% resolution on SEC-Bench and 73.8% on PatchEval using GPT-5-mini, outperforming baselines at lower cost.
-
The Scaling Laws of Skills in LLM Agent Systems
Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations...
-
Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering
NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.
-
Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution
Solvita is an agentic evolution system using Planner, Solver, Oracle, and Hacker agents with trainable graph knowledge networks updated by reinforcement learning on pass/fail and vulnerability signals to achieve SOTA ...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
Reference graph
Works this paper leans on
-
[1]
Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691
work page internal anchor Pith review arXiv 2022
-
[2]
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. (2021). Program synthesis with large language models. arXiv preprint arXiv:2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Brooks, E., Walls, L., Lewis, R. L., and Singh, S. (2022). In-context policy iteration. arXiv preprint arXiv:2210.03821
-
[4]
Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps-Costin, L., Pinckney, D., Yee, M.-H., Zi, Y ., Anderson, C. J., Feldman, M. Q., Guha, A., Greenberg, M., and Jangda, A. (2022). Multipl-e: A scalable and extensible approach to benchmarking neural code generation
work page 2022
-
[5]
Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J.-G., and Chen, W. (2022). Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397
work page internal anchor Pith review arXiv 2022
-
[6]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Chen, X., Lin, M., Schärli, N., and Zhou, D. (2023). Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128
work page internal anchor Pith review arXiv 2023
-
[8]
Côté, M.-A., Kádár, A., Yuan, X., Kybartas, B., Barnes, T., Fine, E., Moore, J., Hausknecht, M., El Asri, L., Adada, M., et al. (2019). Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July...
work page 2019
-
[9]
Goodman, N. (2023). Meta-prompt: A simple self-improving language agent. noahgood- man.substack.com
work page 2023
-
[10]
Kim, G., Baldi, P., and McAleer, S. (2023). Language models can solve computer tasks. arXiv preprint arXiv:2303.17491
work page internal anchor Pith review arXiv 2023
-
[11]
Lam, W., Winter, S., Wei, A., Xie, T., Marinov, D., and Bell, J. (2020). A large-scale longitudinal study of flaky tests. Proc. ACM Program. Lang., 4(OOPSLA)
work page 2020
-
[12]
Le, H., Wang, Y ., Gotmare, A. D., Savarese, S., and Hoi, S. C. H. (2022). Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328
work page 2022
-
[13]
StarCoder: may the source be with you!
Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161
work page internal anchor Pith review arXiv 2023
-
[14]
Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. (2022). Competition-level code generation with alphacode. Science, 378(6624):1092–1097
work page 2022
-
[15]
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., et al. (2023). Self-refine: Iterative refinement with self-feedback.arXiv preprint arXiv:2303.17651
work page internal anchor Pith review arXiv 2023
- [16]
-
[17]
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332
work page internal anchor Pith review arXiv 2021
- [18]
-
[19]
Generative Agents: Interactive Simulacra of Human Behavior
Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442
work page internal anchor Pith review arXiv 2023
- [20]
-
[21]
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C
Pryzant, R., Iter, D., Li, J., Lee, Y . T., Zhu, C., and Zeng, M. (2023). Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495
-
[22]
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761
work page internal anchor Pith review arXiv 2023
-
[23]
Shen, Y ., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y . (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580
work page internal anchor Pith review arXiv 2023
-
[24]
Shridhar, M., Yuan, X., Côté, M.-A., Bisk, Y ., Trischler, A., and Hausknecht, M. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR)
work page 2021
-
[25]
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. The MIT Press, second edition
work page 2018
-
[26]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [27]
-
[28]
W., Salakhutdinov, R., and Manning, C
Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. InConference on Empirical Methods in Natural Language Processing (EMNLP)
work page 2018
-
[29]
Yao, S., Chen, H., Yang, J., and Narasimhan, K. (preprint). Webshop: Towards scalable real-world web interaction with grounded language agents. In ArXiv
-
[30]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . (2023). ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR)
work page 2023
-
[31]
Yoran, O., Wolfson, T., Bogin, B., Katz, U., Deutch, D., and Berant, J. (2023). Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007. 11 A Evaluation with additional models We further investigated the applicability of trial-and-error problem-solving with models of various strengths. We found that the abili...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.