Recognition: 2 theorem links
· Lean TheoremReflexion: Language Agents with Verbal Reinforcement Learning
Pith reviewed 2026-05-10 13:47 UTC · model grok-4.3
The pith
Language agents can learn from trial and error by storing their own verbal reflections on feedback instead of retraining model weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. The framework accepts scalar values or free-form language from either external sources or the agent itself, and produces large gains over a baseline agent on sequential decision-making, coding, and language reasoning tasks.
What carries the argument
An episodic memory buffer that stores the agent's self-generated verbal reflections on past feedback to guide actions in later trials.
If this is right
- The method yields 91 percent pass@1 accuracy on the HumanEval coding benchmark, above the prior 80 percent mark for GPT-4.
- Performance rises across sequential decision-making, coding, and language reasoning when the reflection buffer is added.
- The same framework handles both numeric and free-form feedback signals from external or internal sources.
- Ablation tests reveal how choice of feedback type and incorporation method changes final accuracy.
Where Pith is reading between the lines
- The results imply that language itself can act as a substitute for gradient updates when an agent must adapt to new outcomes.
- Agents equipped with such a buffer may continue improving over many interactions without any external retraining step.
- The approach could be tested on longer-horizon tasks where memory of past linguistic feedback becomes even more critical.
Load-bearing premise
That reflections written by the same language model will be accurate enough and relevant enough to produce reliably better choices on the next attempt.
What would settle it
Running the same agent with and without the reflection step on a held-out task and finding no consistent gain or even a drop in success rate.
read the original abstract
Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Reflexion, a framework in which LLM-based agents generate verbal reflections on task feedback (scalar or linguistic, external or simulated), store the reflections in an episodic memory buffer, and condition future generations on this text to improve performance without any weight updates. It reports substantial gains over baselines across sequential decision-making, coding, and reasoning tasks, with the headline result being 91% pass@1 on HumanEval (vs. prior SOTA of 80% for GPT-4) and includes ablations on feedback types, incorporation methods, and agent variants.
Significance. If the performance claims hold after addressing the controls below, the work would be significant for demonstrating that linguistic self-reflection can serve as an efficient, training-free mechanism for agent improvement. The 11-point HumanEval lift is notable for a coding benchmark, and the framework's flexibility with diverse feedback sources could reduce reliance on expensive fine-tuning. The reported ablation and analysis studies already provide some mechanistic insight into component contributions.
major comments (3)
- [Experiments section (HumanEval subsection)] HumanEval experiments (main results table and associated text): The 91% pass@1 result is presented as evidence for the verbal-reflection-plus-memory mechanism, yet the paper does not report a control condition in which raw execution feedback (e.g., test-case error messages or compiler traces) is appended directly to the prompt for the same number of trials and the same base LLM. Without this baseline, it remains unclear whether the verbal reflection step itself drives the gain over plain GPT-4 or whether iterative prompting with unprocessed feedback would achieve comparable accuracy.
- [Ablation studies] Ablation studies (Section 4 and associated tables): While ablations vary feedback signal type and incorporation method, none directly compare the full Reflexion pipeline against a version that stores and re-uses raw feedback text without an LLM-generated reflection. This omission weakens the central claim that verbal reinforcement learning (as opposed to simple feedback accumulation) is load-bearing for the observed improvements.
- [Method] Method description (Section 3): The precise mechanics of episodic-memory retrieval and prompt construction are underspecified (e.g., whether the entire history is concatenated, whether reflections are summarized or truncated, and how many prior reflections are retained). These details are necessary to assess reproducibility and to understand why the buffer induces better decisions than direct feedback.
minor comments (2)
- [Abstract] The abstract states that Reflexion 'obtains significant improvements over a baseline agent' but does not quantify the gains for the non-HumanEval tasks; adding one or two concrete numbers would strengthen the summary.
- [Results tables] Tables reporting pass@1 or success rates should include the number of independent runs and standard deviations, given the stochasticity of LLM sampling.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the contributions and improve the reproducibility of our work. We address each major point below and commit to revisions where appropriate.
read point-by-point responses
-
Referee: [Experiments section (HumanEval subsection)] HumanEval experiments (main results table and associated text): The 91% pass@1 result is presented as evidence for the verbal-reflection-plus-memory mechanism, yet the paper does not report a control condition in which raw execution feedback (e.g., test-case error messages or compiler traces) is appended directly to the prompt for the same number of trials and the same base LLM. Without this baseline, it remains unclear whether the verbal reflection step itself drives the gain over plain GPT-4 or whether iterative prompting with unprocessed feedback would achieve comparable accuracy.
Authors: We agree this control is valuable for isolating the reflection mechanism. Our baseline agent receives execution feedback but does not generate verbal reflections; however, we did not explicitly test direct appending of raw feedback traces without any reflection step. In the revised manuscript we will add this exact control using GPT-4, the same trial budget, and identical prompt templates except for the absence of reflection generation. This will directly address whether verbalization is load-bearing. revision: yes
-
Referee: [Ablation studies] Ablation studies (Section 4 and associated tables): While ablations vary feedback signal type and incorporation method, none directly compare the full Reflexion pipeline against a version that stores and re-uses raw feedback text without an LLM-generated reflection. This omission weakens the central claim that verbal reinforcement learning (as opposed to simple feedback accumulation) is load-bearing for the observed improvements.
Authors: We acknowledge the gap. Our existing ablations examine feedback types and incorporation strategies, yet they do not include a pure raw-feedback storage baseline. We will add this comparison in the revised Section 4, reporting performance when the episodic buffer stores and re-injects raw execution traces without LLM-generated reflections. This will provide direct evidence on the necessity of the verbal reflection step. revision: yes
-
Referee: [Method] Method description (Section 3): The precise mechanics of episodic-memory retrieval and prompt construction are underspecified (e.g., whether the entire history is concatenated, whether reflections are summarized or truncated, and how many prior reflections are retained). These details are necessary to assess reproducibility and to understand why the buffer induces better decisions than direct feedback.
Authors: We will expand Section 3 with additional detail and pseudocode. The revised text will specify: (1) the buffer stores up to k most recent reflections (k=3 in our experiments), (2) retrieval concatenates all retained reflections in reverse chronological order without summarization, (3) truncation occurs only if total tokens exceed the model context limit by dropping oldest entries first, and (4) the prompt template explicitly places the memory buffer before the current task description. These clarifications will improve reproducibility. revision: yes
Circularity Check
No circularity: empirical benchmark gains from verbal reflection framework
full rationale
The paper proposes Reflexion as an empirical framework where LLMs generate verbal reflections on task feedback, store them in episodic memory, and condition future generations on that text to improve performance without weight updates. The central results are direct pass@1 accuracy comparisons on benchmarks such as HumanEval (91% vs. prior 80% GPT-4), with ablations on feedback types and incorporation methods. No equations, fitted parameters, or self-referential definitions appear in the derivation; the reported improvements are measured outcomes from iterative prompting experiments rather than quantities forced by construction from the inputs. The method's effectiveness is presented as an empirical finding open to external validation, with no load-bearing self-citations or ansatzes that collapse the claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can generate meaningful verbal reflections from task feedback that improve subsequent decisions when stored and retrieved.
invented entities (1)
-
Episodic memory buffer storing reflective text
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
WebArena: A Realistic Web Environment for Building Autonomous Agents
WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.
-
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
ProtoMedAgent uses a privacy-aware agentic workflow with neuro-symbolic bottlenecks to achieve 91.2% faithfulness in clinical report generation, significantly outperforming standard RAG methods on a large patient cohort.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
-
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
-
BIM Information Extraction Through LLM-based Adaptive Exploration
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
-
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
-
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
-
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
-
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
-
Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication
A new structured prompting method (SPEC) helps AI detect insufficient evidence in adjudication tasks and defer decisions appropriately, reaching 89% accuracy on a benchmark varying information completeness from Colora...
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes ac...
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
-
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.
-
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
-
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
MatClaw is a code-first LLM agent that autonomously executes end-to-end materials workflows by generating and running Python scripts on remote clusters, achieving reliable code generation via memory architecture and R...
-
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents
LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
-
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
-
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
-
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
-
LoopTrap: Termination Poisoning Attacks on LLM Agents
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
-
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact ...
-
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
-
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
InvEvolve uses LLMs and RL to generate certified inventory policies that outperform classical and deep learning methods on synthetic and real data while providing multi-period performance guarantees.
-
Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents
A dedicated reviewer agent supplies inference-time feedback on provisional tool calls, yielding gains on BFCL and Tau2-Bench while quantifying helpfulness versus harmfulness tradeoffs.
-
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
-
Thinking with Reasoning Skills: Fewer Tokens, More Accuracy
Distilling and retrieving reusable reasoning skills lets LLMs solve coding and math problems with fewer tokens and higher accuracy.
-
You Don't Need Public Tests to Generate Correct Code
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or ...
-
Job Skill Extraction via LLM-Centric Multi-Module Framework
SRICL combines semantic retrieval from ESCO, in-context learning, fine-tuning, and output verification to achieve higher STRICT-F1 scores and fewer invalid or hallucinated skill spans than GPT-3.5 baselines on six pub...
-
HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation
HELM raises long-horizon VLA success from 58.4% to 81.5% on LIBERO-LONG by combining episodic memory retrieval, learned failure prediction, and replanning, outperforming context extension or adaptation alone.
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
-
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
-
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
-
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
Reference graph
Works this paper leans on
-
[1]
Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691
work page internal anchor Pith review arXiv 2022
-
[2]
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. (2021). Program synthesis with large language models. arXiv preprint arXiv:2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Brooks, E., Walls, L., Lewis, R. L., and Singh, S. (2022). In-context policy iteration. arXiv preprint arXiv:2210.03821
-
[4]
Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps-Costin, L., Pinckney, D., Yee, M.-H., Zi, Y ., Anderson, C. J., Feldman, M. Q., Guha, A., Greenberg, M., and Jangda, A. (2022). Multipl-e: A scalable and extensible approach to benchmarking neural code generation
work page 2022
- [5]
-
[6]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Chen, X., Lin, M., Schärli, N., and Zhou, D. (2023). Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128
work page internal anchor Pith review arXiv 2023
-
[8]
Côté, M.-A., Kádár, A., Yuan, X., Kybartas, B., Barnes, T., Fine, E., Moore, J., Hausknecht, M., El Asri, L., Adada, M., et al. (2019). Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July...
work page 2019
-
[9]
Goodman, N. (2023). Meta-prompt: A simple self-improving language agent. noahgood- man.substack.com
work page 2023
- [10]
-
[11]
Lam, W., Winter, S., Wei, A., Xie, T., Marinov, D., and Bell, J. (2020). A large-scale longitudinal study of flaky tests. Proc. ACM Program. Lang., 4(OOPSLA)
work page 2020
-
[12]
Le, H., Wang, Y ., Gotmare, A. D., Savarese, S., and Hoi, S. C. H. (2022). Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328
work page 2022
-
[13]
StarCoder: may the source be with you!
Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161
work page internal anchor Pith review arXiv 2023
-
[14]
Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. (2022). Competition-level code generation with alphacode. Science, 378(6624):1092–1097
work page 2022
-
[15]
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., et al. (2023). Self-refine: Iterative refinement with self-feedback.arXiv preprint arXiv:2303.17651
work page internal anchor Pith review arXiv 2023
- [16]
-
[17]
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332
work page internal anchor Pith review arXiv 2021
- [18]
-
[19]
Generative Agents: Interactive Simulacra of Human Behavior
Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442
work page internal anchor Pith review arXiv 2023
- [20]
-
[21]
Pryzant, R., Iter, D., Li, J., Lee, Y . T., Zhu, C., and Zeng, M. (2023). Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495
-
[22]
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761
work page internal anchor Pith review arXiv 2023
-
[23]
Shen, Y ., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y . (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580
work page internal anchor Pith review arXiv 2023
-
[24]
Shridhar, M., Yuan, X., Côté, M.-A., Bisk, Y ., Trischler, A., and Hausknecht, M. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proceedings of the International Conference on Learning Representations (ICLR)
work page 2021
-
[25]
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. The MIT Press, second edition
work page 2018
-
[26]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [27]
-
[28]
W., Salakhutdinov, R., and Manning, C
Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. InConference on Empirical Methods in Natural Language Processing (EMNLP)
work page 2018
-
[29]
Yao, S., Chen, H., Yang, J., and Narasimhan, K. (preprint). Webshop: Towards scalable real-world web interaction with grounded language agents. In ArXiv
-
[30]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . (2023). ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR)
work page 2023
-
[31]
Answering questions by meta-reasoning over multiple chains of thought
Yoran, O., Wolfson, T., Bogin, B., Katz, U., Deutch, D., and Berant, J. (2023). Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007. 11 A Evaluation with additional models We further investigated the applicability of trial-and-error problem-solving with models of various strengths. We found that the abili...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.