super hub Mixed citations

Reflexion: Language Agents with Verbal Reinforcement Learning

Ashwin Gopinath, Edward Berman, Federico Cassano, Karthik Narasimhan, Noah Shinn, Shunyu Yao · 2023 · cs.AI · arXiv 2303.11366

Mixed citation behavior. Most common role is background (69%).

180 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 180 citing papers more from Ashwin Gopinath arXiv PDF

abstract

Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 37 method 7 baseline 2 dataset 2

citation-polarity summary

background 33 use method 7 support 3 baseline 2 use dataset 2 unclear 1

claims ledger

abstract Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain the

authors

Ashwin Gopinath Edward Berman Federico Cassano Karthik Narasimhan Noah Shinn Shunyu Yao

co-cited works

representative citing papers

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

cs.CR · 2026-05-24 · unverdicted · novelty 8.0

MemMorph poisons LLM agent long-term memory with three crafted records disguised as facts or policies to hijack tool selection, reaching 85.9% success rate across 10 backbones and outperforming baselines while resisting tested defenses.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

quant-ph · 2025-10-23 · accept · novelty 8.0

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

WebArena: A Realistic Web Environment for Building Autonomous Agents

cs.AI · 2023-07-25 · accept · novelty 8.0

WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TRIAGE augments GRPO with role-typed segment rewards derived from a judge that detects regression and exploration, yielding higher success rates and fewer turns on ALFWorld, Search-QA, and WebShop.

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

What Drives Interactive Improvement from Feedback?

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

LongDS benchmark shows state-of-the-art agents achieve only 48.45% accuracy on long-horizon data analysis tasks, with performance dropping 47 points from early to late turns and state-maintenance errors causing most failures.

Generating Robust Portfolios of Optimization Models using Large Language Models

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

An algorithm generates a portfolio of LLM-produced optimization models with guarantees that high-quality candidates are included if either the generator or evaluator aligns with human preferences.

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

MemFail introduces diagnostic datasets that isolate failure modes in LLM memory systems by testing summarization, storage, and retrieval operations separately.

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

cs.CR · 2026-05-20 · unverdicted · novelty 7.0

HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false positives on complex noisy data.

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

Test-Time Hinting for Black-Box Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

citing papers explorer

Showing 50 of 180 citing papers.

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning cs.CR · 2026-05-24 · unverdicted · none · ref 3 · internal anchor
MemMorph poisons LLM agent long-term memory with three crafted records disguised as facts or policies to hijack tool selection, reaching 85.9% success rate across 10 backbones and outperforming baselines while resisting tested defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts cs.CR · 2026-05-09 · unverdicted · none · ref 3 · 2 links · internal anchor
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems quant-ph · 2025-10-23 · accept · full · ref 52 · internal anchor
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation cs.CR · 2025-07-14 · unverdicted · none · ref 38 · internal anchor
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines cs.CL · 2023-10-05 · conditional · none · ref 50 · internal anchor
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 295 · internal anchor
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
WebArena: A Realistic Web Environment for Building Autonomous Agents cs.AI · 2023-07-25 · accept · none · ref 2 · internal anchor
WebArena provides a realistic multi-domain web environment and benchmark where state-of-the-art LLM agents achieve 14.41% end-to-end task success compared to 78.24% for humans.
TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning cs.LG · 2026-06-30 · unverdicted · none · ref 9 · internal anchor
TRIAGE augments GRPO with role-typed segment rewards derived from a judge that detects regression and exploration, yielding higher success rates and fewer turns on ALFWorld, Search-QA, and WebShop.
Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models cs.SE · 2026-06-30 · unverdicted · none · ref 37 · internal anchor
Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.
What Drives Interactive Improvement from Feedback? cs.AI · 2026-06-29 · unverdicted · none · ref 17 · internal anchor
Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents cs.AI · 2026-06-29 · unverdicted · none · ref 16 · internal anchor
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis cs.LG · 2026-05-28 · unverdicted · none · ref 7 · internal anchor
LongDS benchmark shows state-of-the-art agents achieve only 48.45% accuracy on long-horizon data analysis tasks, with performance dropping 47 points from early to late turns and state-maintenance errors causing most failures.
Generating Robust Portfolios of Optimization Models using Large Language Models cs.AI · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
An algorithm generates a portfolio of LLM-produced optimization models with guarantees that high-quality candidates are included if either the generator or evaluator aligns with human preferences.
MemFail: Stress-Testing Failure Modes of LLM Memory Systems cs.AI · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
MemFail introduces diagnostic datasets that isolate failure modes in LLM memory systems by testing summarization, storage, and retrieval operations separately.
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems cs.AI · 2026-05-22 · unverdicted · none · ref 44 · internal anchor
IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 75 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection cs.CR · 2026-05-20 · unverdicted · none · ref 30 · internal anchor
HIDBench unifies DARPA-E3, DARPA-E5, and NodLink datasets with a data pipeline to benchmark LLMs for host-based intrusion detection, showing high precision on simple logs but sharp drops in MCC and rises in false positives on complex noisy data.
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 29 · internal anchor
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 7 · internal anchor
Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting cs.LG · 2026-05-19 · unverdicted · none · ref 9 · internal anchor
TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 39 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
Test-Time Hinting for Black-Box Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 8 · internal anchor
Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.
Learning, Fast and Slow: Towards LLMs That Adapt Continually cs.LG · 2026-05-12 · unverdicted · none · ref 54 · 2 links · internal anchor
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory cs.AI · 2026-05-11 · unverdicted · none · ref 36 · internal anchor
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents cs.MA · 2026-05-05 · unverdicted · none · ref 41 · internal anchor
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates cs.AI · 2026-05-04 · unverdicted · none · ref 12 · internal anchor
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing cs.AI · 2026-05-04 · unverdicted · none · ref 10 · internal anchor
MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
BIM Information Extraction Through LLM-based Adaptive Exploration cs.CL · 2026-05-03 · unverdicted · none · ref 46 · internal anchor
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework cs.LG · 2026-05-01 · unverdicted · none · ref 17 · internal anchor
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves cs.SE · 2026-04-29 · unverdicted · none · ref 32 · internal anchor
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory cs.CL · 2026-04-29 · unverdicted · none · ref 16 · internal anchor
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 84 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow cs.SE · 2026-04-24 · unverdicted · none · ref 40 · internal anchor
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery cs.CR · 2026-04-22 · unverdicted · none · ref 36 · internal anchor
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs cs.AI · 2026-04-22 · unverdicted · none · ref 12 · internal anchor
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication cs.AI · 2026-04-21 · unverdicted · none · ref 32 · internal anchor
A new structured prompting method (SPEC) helps AI detect insufficient evidence in adjudication tasks and defer decisions appropriately, reaching 89% accuracy on a benchmark varying information completeness from Colorado unemployment insurance cases.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms cs.CL · 2026-04-21 · unverdicted · none · ref 32 · internal anchor
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery cs.CR · 2026-04-21 · unverdicted · none · ref 43 · internal anchor
Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes across libraries, C++ standard, and compilers.
AI scientists produce results without reasoning scientifically cs.AI · 2026-04-20 · conditional · none · ref 40 · internal anchor
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees cs.LG · 2026-04-17 · unverdicted · none · ref 24 · internal anchor
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception cs.AI · 2026-04-06 · unverdicted · none · ref 7 · internal anchor
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent self-diagnosed bugs and maintained cross-channel context.
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration cond-mat.mtrl-sci · 2026-04-03 · conditional · none · ref 22 · 2 links · internal anchor
MatClaw shows a code-first LLM agent autonomously generating and executing workflows for ML force field training, Curie temperature prediction, and parameter search on CuInP2S6, succeeding on code but requiring interventions for tacit domain knowledge.
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations cs.NE · 2026-03-30 · unverdicted · none · ref 26 · internal anchor
BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
LETGAMES: An LLM-Powered Gamified Approach to Cognitive Training for Patients with Cognitive Impairment cs.HC · 2026-02-18 · unverdicted · none · ref 6 · internal anchor
LETGAMES uses LLMs to generate open-world D&D-inspired games with conversational guidance for personalized cognitive training, validated through a new psychology-grounded evaluation protocol showing promise in LLM and human expert assessments.
MemEvolve: Meta-Evolution of Agent Memory Systems cs.CL · 2025-12-21 · unverdicted · none · ref 19 · internal anchor
MemEvolve jointly evolves agent experiential knowledge and memory architectures via a modular codebase, delivering up to 17% gains on agent benchmarks with cross-task and cross-model generalization.
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents cs.AI · 2024-05-23 · accept · none · ref 24 · internal anchor
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
Large Language Models as Optimizers cs.LG · 2023-09-07 · unverdicted · none · ref 34 · internal anchor
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.
Voyager: An Open-Ended Embodied Agent with Large Language Models cs.AI · 2023-05-25 · unverdicted · none · ref 30 · internal anchor
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents cs.AI · 2026-06-30 · unverdicted · none · ref 22 · internal anchor
ReGRPO augments group-relative policy optimization with a reflective data engine that generates ErrorType-Evidence-FixPlan triplets from near-miss tool actions to improve recovery in multimodal agents.
Citation Discipline in Spec-Driven Development: A Cross-Model Empirical Study of Output Determinism and Automated Hallucination Detection in LLM-Generated Code cs.SE · 2026-06-28 · unverdicted · none · ref 21 · internal anchor
Mandatory per-line citations in SDD frameworks reduce LLM output determinism but enable reliable automated hallucination detection (TDR 86-88%, FPR 0%), a trade-off replicated across Claude and GLM models.

Reflexion: Language Agents with Verbal Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer