hub Canonical reference

Large Language Models as Optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou · 2023 · cs.LG · arXiv 2309.03409

Canonical reference. 80% of citing Pith papers cite this work as background.

52 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 52 citing papers arXiv PDF

abstract

Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to our main application in prompt optimization, where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks. Code at https://github.com/google-deepmind/opro.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents

cs.CL · 2026-06-28 · unverdicted · novelty 7.0

A meta-agent uses failure analysis to evolve a task agent's instructions for coordinating lexical, semantic, and multimodal retrievers, leading to up to 19.6 point gains on document QA benchmarks.

Agentic AutoResearch forSpace Autonomy: An Auditable, LLM-Driven Research Agent for Aerospace Control Problems

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

An LLM-driven agent with built-in seed-noise audits develops control policies for two aerospace problems that outperform undirected search and pass verification checks.

Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning

cs.AI · 2026-06-09 · conditional · novelty 7.0

Open LLMs function as structural priors for MIMO controller tuning by proposing asymmetric structures on coupled plants, reaching better penalized cost with fewer evaluations than pure optimization or classical methods.

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

cs.LG · 2026-06-06 · unverdicted · novelty 7.0

DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.

PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

PRISM automates continuous prompt creation, simulation-based testing, diagnosis, and repair for enterprise LLM agents, cutting authoring time to under 30 minutes while reaching 99% reliability and catching drift within 24 hours.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

SoftSkill: Behavioral Compression for Contextual Adaptation

cs.AI · 2026-06-18 · unverdicted · novelty 6.0

SoftSkill compresses agent skills into length-32 continuous prefixes via next-token training of soft deltas, yielding 5.2-12.5 point gains over SkillOpt on SearchQA and LiveMath while using far fewer tokens.

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

LIVE uses language to generate task-centric vision embeddings at inference, reducing hallucinations by 34 points on MMVP, outperforming larger VLMs on VQA, and generalizing to unseen tasks.

How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction?

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Gemini 3.1-Pro with Ukrainian minimal-edits + few-shot prompting reaches F0.5=69.22 on Ukrainian GEC, closing over 90% of the gap to fine-tuned SOTA at 73.14.

From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

MemoPilot trains memory updates for LLM agents via multi-turn GRPO on RPS and poker, achieving top Elo scores and outperforming baselines including DeepSeek-V3.2.

Self-Evolving Scientific Agent Discovers Generalizable Physically-Reasoned Fluid Control

cs.AI · 2026-06-07 · unverdicted · novelty 6.0

An LLM-based self-evolving agent discovers a traveling-wave controller with body-frame guidance and yaw feedback that generalizes to unseen targets for an underactuated fluid swimmer.

You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

HiSME is a lightweight hierarchical meta-evolving approach that learns meta-skills from traces to refine both skills and evolving strategies, producing higher-quality skill libraries than pure skill evolving on agent benchmarks.

DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

DEI shows a heterogeneous four-LLM ensemble achieving 124% higher QD-Score and 28% higher coverage than single-model baselines on Core War at equal compute budget.

LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

Training-free prompt optimization methods, including five new education-focused ones, surpass the strongest RL-trained baseline across five conditions on two OOD suites while showing distinct teaching behavior patterns.

optimize_anything: A Universal API for Optimizing any Text Parameter

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.

Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering

cs.CL · 2026-05-15 · conditional · novelty 6.0

NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

cs.AI · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

FitText embeds evolutionary retrieval of tool descriptions into the agent loop, yielding 2.7-10.6 point NDCG@5 gains on ToolRet and 26.7-point pass-rate gains on StableToolBench.

When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

cs.CL · 2026-05-04 · conditional · novelty 6.0

AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.

ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limited success.

citing papers explorer

Showing 44 of 44 citing papers after filters.

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents cs.CL · 2026-06-28 · unverdicted · none · ref 16 · internal anchor
A meta-agent uses failure analysis to evolve a task agent's instructions for coordinating lexical, semantic, and multimodal retrievers, leading to up to 19.6 point gains on document QA benchmarks.
Agentic AutoResearch forSpace Autonomy: An Auditable, LLM-Driven Research Agent for Aerospace Control Problems cs.RO · 2026-06-18 · unverdicted · none · ref 6 · internal anchor
An LLM-driven agent with built-in seed-noise audits develops control policies for two aerospace problems that outperform undirected search and pass verification checks.
Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning cs.AI · 2026-06-09 · conditional · none · ref 3 · internal anchor
Open LLMs function as structural priors for MIMO controller tuning by proposing asymmetric structures on coupled plants, reaching better penalized cost with fewer evaluations than pure optimization or classical methods.
DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination cs.LG · 2026-06-06 · unverdicted · none · ref 78 · internal anchor
DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.
PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI cs.AI · 2026-05-15 · unverdicted · none · ref 2 · internal anchor
PRISM automates continuous prompt creation, simulation-based testing, diagnosis, and repair for enterprise LLM agents, cutting authoring time to under 30 minutes while reaching 99% reliability and catching drift within 24 hours.
Learning, Fast and Slow: Towards LLMs That Adapt Continually cs.LG · 2026-05-12 · unverdicted · none · ref 66 · 2 links · internal anchor
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments cs.SE · 2026-05-04 · unverdicted · none · ref 29 · internal anchor
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery cs.CR · 2026-04-22 · unverdicted · none · ref 42 · internal anchor
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
SoftSkill: Behavioral Compression for Contextual Adaptation cs.AI · 2026-06-18 · unverdicted · none · ref 19 · internal anchor
SoftSkill compresses agent skills into length-32 continuous prefixes via next-token training of soft deltas, yielding 5.2-12.5 point gains over SkillOpt on SearchQA and LiveMath while using far fewer tokens.
Language-Instructed Vision Embeddings for Controllable and Generalizable Perception cs.CV · 2026-06-17 · unverdicted · none · ref 27 · internal anchor
LIVE uses language to generate task-centric vision embeddings at inference, reducing hallucinations by 34 points on MMVP, outperforming larger VLMs on VQA, and generalizing to unseen tasks.
How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction? cs.CL · 2026-06-08 · unverdicted · none · ref 3 · internal anchor
Gemini 3.1-Pro with Ukrainian minimal-edits + few-shot prompting reaches F0.5=69.22 on Ukrainian GEC, closing over 90% of the gap to fine-tuned SOTA at 73.14.
From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory cs.CL · 2026-06-07 · unverdicted · none · ref 25 · internal anchor
MemoPilot trains memory updates for LLM agents via multi-turn GRPO on RPS and poker, achieving top Elo scores and outperforming baselines including DeepSeek-V3.2.
Self-Evolving Scientific Agent Discovers Generalizable Physically-Reasoned Fluid Control cs.AI · 2026-06-07 · unverdicted · none · ref 14 · internal anchor
An LLM-based self-evolving agent discovers a traveling-wave controller with body-frame guidance and yaw feedback that generalizes to unseen targets for an underactuated fluid swimmer.
You Live More Than Once: Towards Hierarchical Skill Meta-Evolving cs.AI · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
HiSME is a lightweight hierarchical meta-evolving approach that learns meta-skills from traces to refine both skills and evolving strategies, producing higher-quality skill libraries than pure skill evolving on agent benchmarks.
DEI: Diversity in Evolutionary Inference for Quality-Diversity Search cs.LG · 2026-05-26 · unverdicted · none · ref 26 · internal anchor
DEI shows a heterogeneous four-LLM ensemble achieving 124% higher QD-Score and 28% higher coverage than single-model baselines on Core War at equal compute budget.
LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring cs.CL · 2026-05-26 · unverdicted · none · ref 20 · internal anchor
Training-free prompt optimization methods, including five new education-focused ones, surpass the strongest RL-trained baseline across five conditions on two OOD suites while showing distinct teaching behavior patterns.
optimize_anything: A Universal API for Optimizing any Text Parameter cs.CL · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.
Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering cs.CL · 2026-05-15 · conditional · none · ref 35 · internal anchor
NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval cs.AI · 2026-05-04 · unverdicted · none · ref 50 · 2 links · internal anchor
FitText embeds evolutionary retrieval of tool descriptions into the agent loop, yielding 2.7-10.6 point NDCG@5 gains on ToolRet and 26.7-point pass-rate gains on StableToolBench.
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models cs.CL · 2026-05-04 · conditional · none · ref 27 · internal anchor
AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis cs.AI · 2026-04-20 · unverdicted · none · ref 7 · internal anchor
ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization cs.AI · 2026-04-14 · unverdicted · none · ref 26 · internal anchor
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limited success.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 99 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
Reflective Context Learning: Studying the Optimization Primitives of Context Space cs.LG · 2026-04-03 · unverdicted · none · ref 12 · internal anchor
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene
Self-Optimizing Multi-Agent Systems for Deep Research cs.IR · 2026-04-03 · unverdicted · none · ref 18 · internal anchor
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
Two-Stage Prompt Optimization for Few-Shot Relation Extraction: From Reasoning-Guided Search to Gradient-Guided Refinement cs.CL · 2026-06-28 · unverdicted · none · ref 3 · internal anchor
A two-stage prompt optimization framework combining reasoning-guided search with gradient-guided refinement via GradPO reaches state-of-the-art on FS-TACRED using Qwen3-4B.
Narration-of-Thought: Inference-Time Scaffolding for Defeasible Ethical Reasoning in Large Language Models cs.AI · 2026-06-24 · conditional · none · ref 63 · internal anchor
Narration-of-thought prompting reduces stakeholder collapse from up to 31% to under 1% and uncertainty suppression from up to 72% to 1-24% across four LLM generators on 100 DailyDilemmas scenarios.
AlgoEvolve: LLM-driven Meta-evolution of Algorithmic Trading Programs cs.AI · 2026-06-24 · unverdicted · none · ref 32 · internal anchor
An LLM-driven evolutionary framework generates executable trading strategies as Python code and uses a meta-loop to evolve the prompts that guide synthesis.
Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution cs.LG · 2026-06-18 · unverdicted · none · ref 1 · internal anchor
MAA formalizes alignability and comparability conditions and uses differential signals, EMA accumulation, and semantic identity merging to enable cross-batch operation-level evidence accumulation, outperforming batch-level baselines in 14 of 16 settings while matching online methods.
VirtualMLE: A Virtual ML Engineer that Optimizes Sequential Recommenders cs.IR · 2026-06-02 · unverdicted · none · ref 13 · internal anchor
VirtualMLE deploys an LLM agent with execution-reflection-memory to tune sequential recommenders, reaching competitive quality on Amazon benchmarks with fewer trials and transferring heuristics across datasets.
Exploring Autonomous Agentic Data Engineering for Model Specialization cs.CL · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
LLMs functioning as autonomous agents can curate and optimize training data end-to-end, yielding up to 57.29% performance gains on specialized tasks via iterative adaptation guided by post-training metrics.
Prompt Optimization for LLM Code Generation via Reinforcement Learning cs.SE · 2026-05-18 · unverdicted · none · ref 35 · internal anchor
A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast cs.AI · 2026-05-15 · unverdicted · none · ref 22 · internal anchor
FORGE is a staged population protocol that evolves prompt-injected memory (Rules, Examples, or Mixed) for ReAct agents via reflection and broadcast, yielding 1.7-7.7× gains over zero-shot and 29-72% over Reflexion on CybORG CAGE-2.
Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience cs.AI · 2026-05-14 · unverdicted · none · ref 1 · internal anchor
Iterative distillation of experience trains prompting policies that boost black-box LLM performance on reasoning and tool-use tasks from 55-74% to 90-91%.
Evolutionary Ensemble of Agents cs.NE · 2026-05-09 · unverdicted · none · ref 17 · 2 links · internal anchor
EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.
A Control Architecture for Training-Free Memory Use cs.AI · 2026-04-20 · unverdicted · none · ref 28 · internal anchor
A training-free control architecture with uncertainty-based routing, confidence-selective acceptance, and evidence-based memory governance improves arithmetic reasoning by +7 points on SVAMP and ASDiv benchmarks.
Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis cs.AI · 2026-04-12 · unverdicted · none · ref 31 · internal anchor
Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.
AI Training Manager: Bounded Closed-Loop Control of Adaptive Training Recipes cs.AI · 2026-06-29 · unverdicted · none · ref 18 · internal anchor
An LLM-based bounded controller adapts ML training parameters from structured telemetry to correct overfitting and exploration issues, shown on TinyStories and robotic RL tasks.
Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring cs.AI · 2026-06-18 · unverdicted · none · ref 3 · internal anchor
An adaptive prompt router trained in simulation and deployed with high-school students improves exercise conversion to 28.1% and cuts conversation length by about 3 turns compared with static baselines.
Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate cs.AI · 2026-05-17 · unverdicted · none · ref 99 · internal anchor
TIDE integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay tasks including automated scoring, component detection, and relation identification.
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction cs.CY · 2026-04-09 · unverdicted · none · ref 35 · internal anchor
MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs cs.CL · 2026-05-11 · unverdicted · none · ref 13 · internal anchor
A DSPy-based per-stage prompt optimization pipeline with self-consistency achieves second place among full participants in the ArchEHR-QA 2026 EHR QA shared task.
The Hitchhiker's Guide to Agentic AI: From Foundations to Systems cs.AI · 2026-06-22 · unverdicted · none · ref 143 · internal anchor
A comprehensive reference book organizing existing techniques for agentic AI systems across LLM substrate, reasoning, agent design patterns, inter-agent coordination, and production deployment.
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems cs.AI · 2026-04-16 · unreviewed · ref 11 · internal anchor

Large Language Models as Optimizers

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer