arxiv: 2309.08532 · v3 · submitted 2023-09-15 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo , Rui Wang , Junliang Guo , Bei Li , Kaitao Song , Xu Tan , Guoqing Liu , Jiang Bian

show 1 more author

Yujiu Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords prompt optimizationevolutionary algorithmslarge language modelsautomatic prompt engineeringdiscrete optimizationBIG-Bench Hard

0 comments

The pith

EvoPrompt uses LLMs as evolutionary operators to automatically refine prompts and beat human designs by up to 25 percent on hard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoPrompt, a gradient-free method that treats prompt engineering as an evolutionary process. It initializes a population of natural-language prompts and iteratively applies crossover and mutation by querying LLMs to produce new candidate prompts, then keeps the best performers according to a development set. Experiments across 31 datasets show consistent gains over both hand-crafted prompts and prior automatic methods for both closed-source models like GPT-3.5 and open-source ones like Alpaca, with the largest lift reaching 25 percent on BIG-Bench Hard tasks. The approach demonstrates that LLMs can supply the language-generation step while evolutionary selection supplies the optimization pressure.

Core claim

EvoPrompt connects large language models to evolutionary algorithms so that the models themselves implement the variation operators on discrete prompt strings. Starting from an initial population, the method repeatedly asks the LLM to recombine or mutate existing prompts, evaluates the offspring on a held-out development set, and retains the stronger performers, thereby raising task accuracy without any parameter updates or gradient signals.

What carries the argument

LLM-implemented evolutionary operators (crossover and mutation) that take existing prompt strings as input and output new coherent prompt strings for the next generation.

If this is right

Prompts for any new task can be improved automatically from a small seed set without human rewriting.
The same evolutionary loop works unchanged for both API-only and locally runnable LLMs.
Performance gains appear on both understanding and generation tasks as well as on the hardest subset of BIG-Bench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same operator pattern could be applied to other discrete artifacts such as code snippets or molecular strings if suitable fitness functions are defined.
Iterating the evolutionary process inside an agent loop might allow models to self-improve their own instruction following over multiple rounds.
Because the method needs only a development set for selection, it offers a practical route for domains where labeled test data are scarce but a small validation split exists.

Load-bearing premise

LLMs can repeatedly generate coherent, human-readable prompts as evolutionary operators without introducing inconsistencies or quality drift that would stall improvement.

What would settle it

Run the method for ten generations on a new reasoning task and measure whether average prompt coherence (by human rating or lexical diversity) drops below the starting population while accuracy fails to rise.

read the original abstract

Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH). Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoPrompt shows LLMs can run evolutionary operators on discrete prompts to beat human baselines and prior auto methods on 31 datasets, with the biggest lifts on BBH.

read the letter

The core result is that EvoPrompt treats prompt search as an evolutionary loop where the LLM itself generates new candidates via crossover and mutation templates. It starts with a population, scores them on a dev set, and iterates without any gradients or parameter changes. This produces readable prompts that outperform hand-engineered ones and earlier automatic approaches like APE, with gains reaching 25% on BBH tasks across both GPT-3.5 and Alpaca.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces EvoPrompt, a framework that connects large language models with evolutionary algorithms for discrete prompt optimization. It initializes a population of prompts and iteratively applies LLM-based crossover and mutation operators (with provided templates) to generate new candidates, selecting improvements based on development-set performance. Experiments across 31 datasets covering language understanding, generation, and BIG-Bench Hard tasks report that EvoPrompt outperforms human-engineered prompts and prior automatic methods such as APE, with gains reaching up to 25% on BBH for models including GPT-3.5 and Alpaca.

Significance. If the empirical results hold, the work establishes a practical synergy between LLMs and conventional evolutionary algorithms for automating prompt engineering without gradients or parameters. The explicit provision of the LLM operator templates is a clear strength that supports reproducibility and invites follow-on research on hybrid LLM-algorithm systems.

major comments (1)

[Experiments] Experiments section: the central performance claims (outperformance on 31 datasets and up to 25% on BBH) are presented without reported statistical significance tests, standard deviations or variance across multiple runs, explicit prompt-length or token-budget controls relative to baselines, or details on the exact number of independent trials. These omissions make it difficult to assess the robustness of the reported gains.

minor comments (2)

[Abstract] Abstract: the phrase 'up to 25% on BBH' would benefit from specifying the exact metric, baseline, and task subset to allow immediate interpretation.
[Method] Method description: while the evolutionary loop is clearly outlined, a short pseudocode block or explicit enumeration of population size, number of generations, and selection mechanism would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental robustness. We address the single major comment point by point below and commit to revisions that directly strengthen the presentation of results.

read point-by-point responses

Referee: Experiments section: the central performance claims (outperformance on 31 datasets and up to 25% on BBH) are presented without reported statistical significance tests, standard deviations or variance across multiple runs, explicit prompt-length or token-budget controls relative to baselines, or details on the exact number of independent trials. These omissions make it difficult to assess the robustness of the reported gains.

Authors: We agree that these details are necessary for a rigorous assessment of the claims. In the revised manuscript we will rerun the key experiments (including the BBH suite and representative subsets of the 31 datasets) across at least five independent trials with different random seeds, reporting mean performance together with standard deviations. We will add paired t-tests (or Wilcoxon signed-rank tests where normality assumptions are violated) to establish statistical significance of the reported gains over baselines. We will also insert an explicit analysis of prompt length and token usage, ensuring that EvoPrompt-generated prompts are compared against baselines under comparable length/token budgets; any residual differences will be noted and discussed. The exact number of trials and the random-seed protocol will be stated clearly in the Experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes an empirical framework (EvoPrompt) that applies LLMs as crossover and mutation operators within an evolutionary loop over discrete prompts, with fitness evaluated on held-out development sets. No equations, first-principles derivations, or parameter-fitting steps are present that would reduce reported performance gains to quantities defined inside the method itself. All central claims rest on external benchmark results across 31 datasets, with explicit prompt templates supplied for the evolutionary operators, enabling independent reproduction. The approach therefore contains no self-definitional, fitted-input, or self-citation-load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the unproven assumption that LLMs can serve as stable evolutionary operators on natural language.

axioms (1)

domain assumption LLMs can perform coherent crossover and mutation on discrete natural-language prompts while preserving readability and task relevance.
This assumption is required for the evolutionary loop to function without external supervision.

pith-pipeline@v0.9.0 · 5566 in / 1018 out tokens · 98610 ms · 2026-05-16T06:08:22.712798+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
cs.SE 2026-05 unverdicted novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation
cs.CR 2026-04 unverdicted novelty 7.0

DEJA uses evolutionary optimization guided by an LLM-based Answer Utility Score to induce soft-failure responses in RAG systems, achieving over 79% soft attack success rate with under 15% hard failures and high stealt...
Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS
cs.CL 2026-04 unverdicted novelty 7.0

Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA an...
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
cs.LG 2026-04 unverdicted novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
Large Language Models as Optimizers
cs.LG 2023-09 unverdicted novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation
cs.AI 2026-05 unverdicted novelty 6.0

OpenDeepThink improves LLM reasoning by ranking parallel candidate traces via Bradley-Terry aggregation of LLM pairwise judgments, achieving a +405 Codeforces Elo gain on Gemini 3.1 Pro after eight rounds.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
cs.AI 2026-05 unverdicted novelty 6.0

FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
AgentGA: Evolving Code Solutions in Agent-Seed Space
cs.AI 2026-04 unverdicted novelty 6.0

AgentGA uses a genetic algorithm to evolve agent seeds and achieves 74.52% human-exceeding performance on tabular AutoML tasks versus 54.15% for the AIDE baseline.
AgentGA: Evolving Code Solutions in Agent-Seed Space
cs.AI 2026-04 unverdicted novelty 6.0

AgentGA optimizes agent seeds with genetic algorithms and parent-archive inheritance to improve autonomous code generation, beating a baseline on 15 of 16 Kaggle competitions.
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
cs.AI 2026-04 unverdicted novelty 6.0

Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
cs.AI 2026-04 unverdicted novelty 6.0

MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals...
AI-Driven Research for Databases
cs.DB 2026-04 unverdicted novelty 6.0

Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
GEAR: Genetic AutoResearch for Agentic Code Evolution
cs.NE 2026-05 unverdicted novelty 5.0

GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction
cs.CL 2026-05 unverdicted novelty 5.0

Small open-weight language models can self-optimize prompts for clinical named entity recognition in dental notes, reaching micro F1 of 0.864 after DPO on Qwen2.5-14B.
Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
cs.SE 2026-04 accept novelty 4.0

Execution feedback in refinement loops improves 1-3B code generation performance far more than complex pipeline topologies discovered via evolutionary search on HumanEval and sanitized MBPP.

Reference graph

Works this paper leans on

153 extracted references · 153 canonical work pages · cited by 18 Pith papers · 6 internal anchors

[1]

Asset: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations

Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Beno \^ t Sagot, and Lucia Specia. Asset: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4668--4679, 2020

work page 2020
[2]

Promptsource: An integrated development environment and repository for natural language prompts

Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault F \'e vry, et al. Promptsource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System De...

work page 2022
[3]

Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems

Janez Brest, Sao Greiner, Borko Boskovic, Marjan Mernik, and Viljem Zumer. Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. IEEE transactions on evolutionary computation, 10 0 (6): 0 646--657, 2006

work page 2006
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[6]

Introduction to derivative-free optimization

Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free optimization. SIAM, 2009

work page 2009
[7]

Differential evolution: A survey of the state-of-the-art

Swagatam Das and Ponnuthurai Nagaratnam Suganthan. Differential evolution: A survey of the state-of-the-art. IEEE transactions on evolutionary computation, 15 0 (1): 0 4--31, 2010

work page 2010
[8]

Recent advances in differential evolution--an updated survey

Swagatam Das, Sankha Subhra Mullick, and Ponnuthurai N Suganthan. Recent advances in differential evolution--an updated survey. Swarm and evolutionary computation, 27: 0 1--30, 2016

work page 2016
[9]

Rlprompt: Optimizing discrete text prompts with reinforcement learning

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 3369--3391, 2022

work page 2022
[10]

Ant colony system: a cooperative learning approach to the traveling salesman problem

Marco Dorigo and Luca Maria Gambardella. Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Transactions on evolutionary computation, 1 0 (1): 0 53--66, 1997

work page 1997
[14]

John H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, 1975. ISBN 0262581116

work page 1975
[15]

Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence

John H Holland. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992

work page 1992
[16]

Mining and summarizing customer reviews

Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In KDD, pp.\ 168--177, 2004

work page 2004
[18]

How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

work page 2020
[19]

Particle swarm optimization

James Kennedy and Russell Eberhart. Particle swarm optimization. In Proceedings of ICNN'95-international conference on neural networks, volume 4, pp.\ 1942--1948. IEEE, 1995

work page 1942
[20]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

work page 2022
[23]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In EMNLP, pp.\ 3045--3059, 2021

work page 2021
[25]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 4582--4597, 2021

work page 2021
[26]

Roulette-wheel selection via stochastic acceptance

Adam Lipowski and Dorota Lipowska. Roulette-wheel selection via stochastic acceptance. Physica A: Statistical Mechanics and its Applications, 391 0 (6): 0 2193--2196, 2012

work page 2012
[27]

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55 0 (9): 0 1--35, 2023

work page 2023
[31]

Genetic algorithm: Theory, literature review, and application in image reconstruction

Seyedali Mirjalili, Jin Song Dong, Ali Safa Sadiq, and Hossam Faris. Genetic algorithm: Theory, literature review, and application in image reconstruction. Nature-Inspired Optimizers: Theories, Literature Reviews and Applications, pp.\ 69--85, 2020

work page 2020
[32]

Reframing instructional prompts to gptk’s language

Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. Reframing instructional prompts to gptk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 589--612, 2022 a

work page 2022
[33]

Cross-task generalization via natural language crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3470--3487, 2022 b

work page 2022
[34]

Cross-task generalization via natural language crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In ACL, 2022 c

work page 2022
[35]

An introduction to genetic algorithms

Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998

work page 1998
[38]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[39]

Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales

Bo PANG. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, 2005

work page 2005
[40]

A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts

Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp.\ 271--278, 2004

work page 2004
[41]

Differential evolution: A review of more than two decades of research

Millie Pant, Hira Zaheer, Laura Garcia-Hernandez, Ajith Abraham, et al. Differential evolution: A review of more than two decades of research. Engineering Applications of Artificial Intelligence, 90: 0 103479, 2020

work page 2020
[43]

Differential evolution

Kenneth V Price. Differential evolution. In Handbook of optimization: From classical to modern approach, pp.\ 187--214. Springer, 2013

work page 2013
[46]

Derivative-free optimization: a review of algorithms and comparison of software implementations

Luis Miguel Rios and Nikolaos V Sahinidis. Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 56: 0 1247--1293, 2013

work page 2013
[48]

Exploiting cloze-questions for few-shot text classification and natural language inference

Timo Schick and Hinrich Sch \"u tze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.\ 255--269, 2021

work page 2021
[51]

Autoprompt: Eliciting knowledge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 4222--4235, 2020

work page 2020
[52]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pp.\ 1631--1642, 2013

work page 2013
[53]

Differential evolution--a simple and efficient heuristic for global optimization over continuous spaces

Rainer Storn and Kenneth Price. Differential evolution--a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization, 11: 0 341--359, 1997

work page 1997
[55]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[57]

A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems

Jakob Vesterstrom and Rene Thomsen. A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems. In Proceedings of the 2004 congress on evolutionary computation (IEEE Cat. No. 04TH8753), volume 2, pp.\ 1980--1987. IEEE, 2004

work page 2004
[58]

Building a question answering test collection

Ellen M Voorhees and Dawn M Tice. Building a question answering test collection. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp.\ 200--207, 2000

work page 2000
[59]

Universal adversarial triggers for attacking and analyzing nlp

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2153--2162, 2019

work page 2019
[61]

Tournament selection --- Wikipedia , the free encyclopedia

Wikipedia contributors . Tournament selection --- Wikipedia , the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Tournament_selection&oldid=1160627612, 2023. [Online; accessed 26-September-2023]

work page 2023
[62]

Optimizing statistical machine translation for text simplification

Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4: 0 401--415, 2016

work page 2016
[63]

Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts

JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp.\ 1--21, 2023

work page 2023
[64]

Sanderson

Jingqiao Zhang and Arthur C. Sanderson. Jade: Adaptive differential evolution with optional external archive. IEEE Transactions on Evolutionary Computation, 13 0 (5): 0 945--958, 2009. doi:10.1109/TEVC.2009.2014613

work page doi:10.1109/tevc.2009.2014613 2009
[65]

Differentiable prompt makes pre-trained language models better few-shot learners

Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, and Huajun Chen. Differentiable prompt makes pre-trained language models better few-shot learners. In International Conference on Learning Representations, 2021

work page 2021
[67]

Tempera: Test-time prompt editing via reinforcement learning

Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E Gonzalez. Tempera: Test-time prompt editing via reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023 a

work page 2023
[69]

Character-level convolutional networks for text classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. NeurIPS, 28, 2015

work page 2015
[72]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2022

work page 2022
[74]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[75]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[76]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[77]

Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education

Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)

work page
[78]

Classification Problem Solving

Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence

work page
[79]

, title =

Robinson, Arthur L. , title =. 1980 , doi =. https://science.sciencemag.org/content/208/4447/1019.full.pdf , journal =

work page 1980
[80]

New Ways to Make Microcircuits Smaller---Duplicate Entry

Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science

work page
[81]

Clancey and Glenn Rennels , abstract =

Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =

work page doi:10.1016/s0020-7373(84)80003-6 1984
[82]

and Rennels, Glenn R

Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies

work page
[83]

Poligon: A System for Parallel Problem Solving

Rice, James. Poligon: A System for Parallel Problem Solving

work page
[84]

Transfer of Rule-Based Expertise through a Tutorial Dialogue

Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue

work page
[85]

The Engineering of Qualitative Models

Clancey, William J. The Engineering of Qualitative Models

work page
[86]

2017 , eprint=

Attention Is All You Need , author=. 2017 , eprint=

work page 2017
[87]

Pluto: The 'Other' Red Planet

NASA. Pluto: The 'Other' Red Planet

work page
[88]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[89]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[90]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[91]

ACM Computing Surveys , volume=

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023
[92]

The Eleventh International Conference on Learning Representations , year=

Large Language Models are Human-Level Prompt Engineers , author=. The Eleventh International Conference on Learning Representations , year=

work page
[93]

Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , pages=

Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts , author=. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , pages=

work page 2023
[94]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2022
[95]

The Eleventh International Conference on Learning Representations , year=

TEMPERA: Test-Time Prompt Editing via Reinforcement Learning , author=. The Eleventh International Conference on Learning Representations , year=

work page
[96]

arXiv preprint arXiv:2212.12017 , year=

Srinivasan Iyer and Xi Victoria Lin and Ramakanth Pasunuru and Todor Mihaylov and Daniel Simig and Ping Yu and Kurt Shuster and Tianlu Wang and Qing Liu and Punit Singh Koura and Xian Li and Brian O'Horo and Gabriel Pereyra and Jeff Wang and Christopher Dewan and Asli Celikyilmaz and Luke Zettlemoyer and Ves Stoyanov , title =. CoRR , volume =. 2022 , url...

work page doi:10.48550/arxiv.2212.12017 2022
[97]

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[98]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Cross-Task Generalization via Natural Language Crowdsourcing Instructions , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[99]

Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) , pages=

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , author=. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) , pages=

work page
[100]

Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval , pages=

Building a question answering test collection , author=. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval , pages=

work page
[101]

Semantic web , volume=

Dbpedia--a large-scale, multilingual knowledge base extracted from wikipedia , author=. Semantic web , volume=. 2015 , publisher=

work page 2015
[102]

NeurIPS , volume=

Character-level convolutional networks for text classification , author=. NeurIPS , volume=

work page
[103]

EMNLP , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. EMNLP , pages=

work page
[104]

ACL , year=

Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales , author=. ACL , year=

work page
[105]

KDD , pages=

Mining and summarizing customer reviews , author=. KDD , pages=

work page
[106]

BBT v2: Towards a Gradient-Free Future with Large Language Models

Sun, Tianxiang and He, Zhengfu and Qian, Hong and Zhou, Yunhua and Huang, Xuanjing and Qiu, Xipeng. BBT v2: Towards a Gradient-Free Future with Large Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022

work page 2022
[107]

International Conference on Machine Learning , pages=

Black-box tuning for language-model-as-a-service , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022

Showing first 80 references.