pith. machine review for the scientific record. sign in

arxiv: 2309.08532 · v3 · submitted 2023-09-15 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords prompt optimizationevolutionary algorithmslarge language modelsautomatic prompt engineeringdiscrete optimizationBIG-Bench Hard
0
0 comments X

The pith

EvoPrompt uses LLMs as evolutionary operators to automatically refine prompts and beat human designs by up to 25 percent on hard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoPrompt, a gradient-free method that treats prompt engineering as an evolutionary process. It initializes a population of natural-language prompts and iteratively applies crossover and mutation by querying LLMs to produce new candidate prompts, then keeps the best performers according to a development set. Experiments across 31 datasets show consistent gains over both hand-crafted prompts and prior automatic methods for both closed-source models like GPT-3.5 and open-source ones like Alpaca, with the largest lift reaching 25 percent on BIG-Bench Hard tasks. The approach demonstrates that LLMs can supply the language-generation step while evolutionary selection supplies the optimization pressure.

Core claim

EvoPrompt connects large language models to evolutionary algorithms so that the models themselves implement the variation operators on discrete prompt strings. Starting from an initial population, the method repeatedly asks the LLM to recombine or mutate existing prompts, evaluates the offspring on a held-out development set, and retains the stronger performers, thereby raising task accuracy without any parameter updates or gradient signals.

What carries the argument

LLM-implemented evolutionary operators (crossover and mutation) that take existing prompt strings as input and output new coherent prompt strings for the next generation.

If this is right

  • Prompts for any new task can be improved automatically from a small seed set without human rewriting.
  • The same evolutionary loop works unchanged for both API-only and locally runnable LLMs.
  • Performance gains appear on both understanding and generation tasks as well as on the hardest subset of BIG-Bench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same operator pattern could be applied to other discrete artifacts such as code snippets or molecular strings if suitable fitness functions are defined.
  • Iterating the evolutionary process inside an agent loop might allow models to self-improve their own instruction following over multiple rounds.
  • Because the method needs only a development set for selection, it offers a practical route for domains where labeled test data are scarce but a small validation split exists.

Load-bearing premise

LLMs can repeatedly generate coherent, human-readable prompts as evolutionary operators without introducing inconsistencies or quality drift that would stall improvement.

What would settle it

Run the method for ten generations on a new reasoning task and measure whether average prompt coherence (by human rating or lexical diversity) drops below the starting population while accuracy fails to rise.

read the original abstract

Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH). Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces EvoPrompt, a framework that connects large language models with evolutionary algorithms for discrete prompt optimization. It initializes a population of prompts and iteratively applies LLM-based crossover and mutation operators (with provided templates) to generate new candidates, selecting improvements based on development-set performance. Experiments across 31 datasets covering language understanding, generation, and BIG-Bench Hard tasks report that EvoPrompt outperforms human-engineered prompts and prior automatic methods such as APE, with gains reaching up to 25% on BBH for models including GPT-3.5 and Alpaca.

Significance. If the empirical results hold, the work establishes a practical synergy between LLMs and conventional evolutionary algorithms for automating prompt engineering without gradients or parameters. The explicit provision of the LLM operator templates is a clear strength that supports reproducibility and invites follow-on research on hybrid LLM-algorithm systems.

major comments (1)
  1. [Experiments] Experiments section: the central performance claims (outperformance on 31 datasets and up to 25% on BBH) are presented without reported statistical significance tests, standard deviations or variance across multiple runs, explicit prompt-length or token-budget controls relative to baselines, or details on the exact number of independent trials. These omissions make it difficult to assess the robustness of the reported gains.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'up to 25% on BBH' would benefit from specifying the exact metric, baseline, and task subset to allow immediate interpretation.
  2. [Method] Method description: while the evolutionary loop is clearly outlined, a short pseudocode block or explicit enumeration of population size, number of generations, and selection mechanism would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental robustness. We address the single major comment point by point below and commit to revisions that directly strengthen the presentation of results.

read point-by-point responses
  1. Referee: Experiments section: the central performance claims (outperformance on 31 datasets and up to 25% on BBH) are presented without reported statistical significance tests, standard deviations or variance across multiple runs, explicit prompt-length or token-budget controls relative to baselines, or details on the exact number of independent trials. These omissions make it difficult to assess the robustness of the reported gains.

    Authors: We agree that these details are necessary for a rigorous assessment of the claims. In the revised manuscript we will rerun the key experiments (including the BBH suite and representative subsets of the 31 datasets) across at least five independent trials with different random seeds, reporting mean performance together with standard deviations. We will add paired t-tests (or Wilcoxon signed-rank tests where normality assumptions are violated) to establish statistical significance of the reported gains over baselines. We will also insert an explicit analysis of prompt length and token usage, ensuring that EvoPrompt-generated prompts are compared against baselines under comparable length/token budgets; any residual differences will be noted and discussed. The exact number of trials and the random-seed protocol will be stated clearly in the Experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes an empirical framework (EvoPrompt) that applies LLMs as crossover and mutation operators within an evolutionary loop over discrete prompts, with fitness evaluated on held-out development sets. No equations, first-principles derivations, or parameter-fitting steps are present that would reduce reported performance gains to quantities defined inside the method itself. All central claims rest on external benchmark results across 31 datasets, with explicit prompt templates supplied for the evolutionary operators, enabling independent reproduction. The approach therefore contains no self-definitional, fitted-input, or self-citation-load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the unproven assumption that LLMs can serve as stable evolutionary operators on natural language.

axioms (1)
  • domain assumption LLMs can perform coherent crossover and mutation on discrete natural-language prompts while preserving readability and task relevance.
    This assumption is required for the evolutionary loop to function without external supervision.

pith-pipeline@v0.9.0 · 5566 in / 1018 out tokens · 98610 ms · 2026-05-16T06:08:22.712798+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    cs.CL 2023-10 conditional novelty 8.0

    DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

  2. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  3. TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

    cs.SE 2026-05 unverdicted novelty 7.0

    TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

  4. Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation

    cs.CR 2026-04 unverdicted novelty 7.0

    DEJA uses evolutionary optimization guided by an LLM-based Answer Utility Score to induce soft-failure responses in RAG systems, achieving over 79% soft attack success rate with under 15% hard failures and high stealt...

  5. Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS

    cs.CL 2026-04 unverdicted novelty 7.0

    Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA an...

  6. PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

    cs.LG 2026-04 unverdicted novelty 7.0

    PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.

  7. Large Language Models as Optimizers

    cs.LG 2023-09 unverdicted novelty 7.0

    Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...

  8. OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

    cs.AI 2026-05 unverdicted novelty 6.0

    OpenDeepThink improves LLM reasoning by ranking parallel candidate traces via Bradley-Terry aggregation of LLM pairwise judgments, achieving a +405 Codeforces Elo gain on Gemini 3.1 Pro after eight rounds.

  9. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  10. EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...

  11. FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

    cs.AI 2026-05 unverdicted novelty 6.0

    FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.

  12. AgentGA: Evolving Code Solutions in Agent-Seed Space

    cs.AI 2026-04 unverdicted novelty 6.0

    AgentGA uses a genetic algorithm to evolve agent seeds and achieves 74.52% human-exceeding performance on tabular AutoML tasks versus 54.15% for the AIDE baseline.

  13. AgentGA: Evolving Code Solutions in Agent-Seed Space

    cs.AI 2026-04 unverdicted novelty 6.0

    AgentGA optimizes agent seeds with genetic algorithms and parent-archive inheritance to improve autonomous code generation, beating a baseline on 15 of 16 Kaggle competitions.

  14. Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.

  15. Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

    cs.AI 2026-04 unverdicted novelty 6.0

    MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals...

  16. AI-Driven Research for Databases

    cs.DB 2026-04 unverdicted novelty 6.0

    Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.

  17. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  18. GEAR: Genetic AutoResearch for Agentic Code Evolution

    cs.NE 2026-05 unverdicted novelty 5.0

    GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.

  19. Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

    cs.CL 2026-05 unverdicted novelty 5.0

    Small open-weight language models can self-optimize prompts for clinical named entity recognition in dental notes, reaching micro F1 of 0.864 after DPO on Qwen2.5-14B.

  20. Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation

    cs.SE 2026-04 accept novelty 4.0

    Execution feedback in refinement loops improves 1-3B code generation performance far more than complex pipeline topologies discovered via evolutionary search on HumanEval and sanitized MBPP.

Reference graph

Works this paper leans on

153 extracted references · 153 canonical work pages · cited by 18 Pith papers · 6 internal anchors

  1. [1]

    Asset: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations

    Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Beno \^ t Sagot, and Lucia Specia. Asset: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4668--4679, 2020

  2. [2]

    Promptsource: An integrated development environment and repository for natural language prompts

    Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault F \'e vry, et al. Promptsource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System De...

  3. [3]

    Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems

    Janez Brest, Sao Greiner, Borko Boskovic, Marjan Mernik, and Viljem Zumer. Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. IEEE transactions on evolutionary computation, 10 0 (6): 0 646--657, 2006

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  5. [6]

    Introduction to derivative-free optimization

    Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free optimization. SIAM, 2009

  6. [7]

    Differential evolution: A survey of the state-of-the-art

    Swagatam Das and Ponnuthurai Nagaratnam Suganthan. Differential evolution: A survey of the state-of-the-art. IEEE transactions on evolutionary computation, 15 0 (1): 0 4--31, 2010

  7. [8]

    Recent advances in differential evolution--an updated survey

    Swagatam Das, Sankha Subhra Mullick, and Ponnuthurai N Suganthan. Recent advances in differential evolution--an updated survey. Swarm and evolutionary computation, 27: 0 1--30, 2016

  8. [9]

    Rlprompt: Optimizing discrete text prompts with reinforcement learning

    Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 3369--3391, 2022

  9. [10]

    Ant colony system: a cooperative learning approach to the traveling salesman problem

    Marco Dorigo and Luca Maria Gambardella. Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Transactions on evolutionary computation, 1 0 (1): 0 53--66, 1997

  10. [14]

    John H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, 1975. ISBN 0262581116

  11. [15]

    Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence

    John H Holland. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992

  12. [16]

    Mining and summarizing customer reviews

    Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In KDD, pp.\ 168--177, 2004

  13. [18]

    How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

    Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

  14. [19]

    Particle swarm optimization

    James Kennedy and Russell Eberhart. Particle swarm optimization. In Proceedings of ICNN'95-international conference on neural networks, volume 4, pp.\ 1942--1948. IEEE, 1995

  15. [20]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

  16. [23]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In EMNLP, pp.\ 3045--3059, 2021

  17. [25]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 4582--4597, 2021

  18. [26]

    Roulette-wheel selection via stochastic acceptance

    Adam Lipowski and Dorota Lipowska. Roulette-wheel selection via stochastic acceptance. Physica A: Statistical Mechanics and its Applications, 391 0 (6): 0 2193--2196, 2012

  19. [27]

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55 0 (9): 0 1--35, 2023

  20. [31]

    Genetic algorithm: Theory, literature review, and application in image reconstruction

    Seyedali Mirjalili, Jin Song Dong, Ali Safa Sadiq, and Hossam Faris. Genetic algorithm: Theory, literature review, and application in image reconstruction. Nature-Inspired Optimizers: Theories, Literature Reviews and Applications, pp.\ 69--85, 2020

  21. [32]

    Reframing instructional prompts to gptk’s language

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. Reframing instructional prompts to gptk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 589--612, 2022 a

  22. [33]

    Cross-task generalization via natural language crowdsourcing instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3470--3487, 2022 b

  23. [34]

    Cross-task generalization via natural language crowdsourcing instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In ACL, 2022 c

  24. [35]

    An introduction to genetic algorithms

    Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998

  25. [38]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  26. [39]

    Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales

    Bo PANG. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, 2005

  27. [40]

    A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts

    Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pp.\ 271--278, 2004

  28. [41]

    Differential evolution: A review of more than two decades of research

    Millie Pant, Hira Zaheer, Laura Garcia-Hernandez, Ajith Abraham, et al. Differential evolution: A review of more than two decades of research. Engineering Applications of Artificial Intelligence, 90: 0 103479, 2020

  29. [43]

    Differential evolution

    Kenneth V Price. Differential evolution. In Handbook of optimization: From classical to modern approach, pp.\ 187--214. Springer, 2013

  30. [46]

    Derivative-free optimization: a review of algorithms and comparison of software implementations

    Luis Miguel Rios and Nikolaos V Sahinidis. Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 56: 0 1247--1293, 2013

  31. [48]

    Exploiting cloze-questions for few-shot text classification and natural language inference

    Timo Schick and Hinrich Sch \"u tze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.\ 255--269, 2021

  32. [51]

    Autoprompt: Eliciting knowledge from language models with automatically generated prompts

    Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 4222--4235, 2020

  33. [52]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pp.\ 1631--1642, 2013

  34. [53]

    Differential evolution--a simple and efficient heuristic for global optimization over continuous spaces

    Rainer Storn and Kenneth Price. Differential evolution--a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization, 11: 0 341--359, 1997

  35. [55]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  36. [57]

    A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems

    Jakob Vesterstrom and Rene Thomsen. A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems. In Proceedings of the 2004 congress on evolutionary computation (IEEE Cat. No. 04TH8753), volume 2, pp.\ 1980--1987. IEEE, 2004

  37. [58]

    Building a question answering test collection

    Ellen M Voorhees and Dawn M Tice. Building a question answering test collection. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp.\ 200--207, 2000

  38. [59]

    Universal adversarial triggers for attacking and analyzing nlp

    Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2153--2162, 2019

  39. [61]

    Tournament selection --- Wikipedia , the free encyclopedia

    Wikipedia contributors . Tournament selection --- Wikipedia , the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Tournament_selection&oldid=1160627612, 2023. [Online; accessed 26-September-2023]

  40. [62]

    Optimizing statistical machine translation for text simplification

    Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4: 0 401--415, 2016

  41. [63]

    Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts

    JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp.\ 1--21, 2023

  42. [64]

    Sanderson

    Jingqiao Zhang and Arthur C. Sanderson. Jade: Adaptive differential evolution with optional external archive. IEEE Transactions on Evolutionary Computation, 13 0 (5): 0 945--958, 2009. doi:10.1109/TEVC.2009.2014613

  43. [65]

    Differentiable prompt makes pre-trained language models better few-shot learners

    Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, and Huajun Chen. Differentiable prompt makes pre-trained language models better few-shot learners. In International Conference on Learning Representations, 2021

  44. [67]

    Tempera: Test-time prompt editing via reinforcement learning

    Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E Gonzalez. Tempera: Test-time prompt editing via reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023 a

  45. [69]

    Character-level convolutional networks for text classification

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. NeurIPS, 28, 2015

  46. [72]

    Large language models are human-level prompt engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2022

  47. [74]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  48. [75]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  49. [76]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  50. [77]

    Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education

    Clancey, William J. Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education. Proceedings of the Eighth International Joint Conference on Artificial Intelligence (IJCAI-83)

  51. [78]

    Classification Problem Solving

    Clancey, William J. Classification Problem Solving. Proceedings of the Fourth National Conference on Artificial Intelligence

  52. [79]

    , title =

    Robinson, Arthur L. , title =. 1980 , doi =. https://science.sciencemag.org/content/208/4447/1019.full.pdf , journal =

  53. [80]

    New Ways to Make Microcircuits Smaller---Duplicate Entry

    Robinson, Arthur L. New Ways to Make Microcircuits Smaller---Duplicate Entry. Science

  54. [81]

    Clancey and Glenn Rennels , abstract =

    Diane Warner Hasling and William J. Clancey and Glenn Rennels , abstract =. Strategic explanations for a diagnostic consultation system , journal =. 1984 , issn =. doi:https://doi.org/10.1016/S0020-7373(84)80003-6 , url =

  55. [82]

    and Rennels, Glenn R

    Hasling, Diane Warner and Clancey, William J. and Rennels, Glenn R. and Test, Thomas. Strategic Explanations in Consultation---Duplicate. The International Journal of Man-Machine Studies

  56. [83]

    Poligon: A System for Parallel Problem Solving

    Rice, James. Poligon: A System for Parallel Problem Solving

  57. [84]

    Transfer of Rule-Based Expertise through a Tutorial Dialogue

    Clancey, William J. Transfer of Rule-Based Expertise through a Tutorial Dialogue

  58. [85]

    The Engineering of Qualitative Models

    Clancey, William J. The Engineering of Qualitative Models

  59. [86]

    2017 , eprint=

    Attention Is All You Need , author=. 2017 , eprint=

  60. [87]

    Pluto: The 'Other' Red Planet

    NASA. Pluto: The 'Other' Red Planet

  61. [88]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  62. [89]

    OPT: Open Pre-trained Transformer Language Models

    Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

  63. [90]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  64. [91]

    ACM Computing Surveys , volume=

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. ACM Computing Surveys , volume=. 2023 , publisher=

  65. [92]

    The Eleventh International Conference on Learning Representations , year=

    Large Language Models are Human-Level Prompt Engineers , author=. The Eleventh International Conference on Learning Representations , year=

  66. [93]

    Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , pages=

    Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts , author=. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , pages=

  67. [94]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  68. [95]

    The Eleventh International Conference on Learning Representations , year=

    TEMPERA: Test-Time Prompt Editing via Reinforcement Learning , author=. The Eleventh International Conference on Learning Representations , year=

  69. [96]

    arXiv preprint arXiv:2212.12017 , year=

    Srinivasan Iyer and Xi Victoria Lin and Ramakanth Pasunuru and Todor Mihaylov and Daniel Simig and Ping Yu and Kurt Shuster and Tianlu Wang and Qing Liu and Punit Singh Koura and Xian Li and Brian O'Horo and Gabriel Pereyra and Jeff Wang and Christopher Dewan and Asli Celikyilmaz and Luke Zettlemoyer and Ves Stoyanov , title =. CoRR , volume =. 2022 , url...

  70. [97]

    Transformers: State-of-the-Art Natural Language Processing

    Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

  71. [98]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Cross-Task Generalization via Natural Language Crowdsourcing Instructions , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  72. [99]

    Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) , pages=

    A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , author=. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) , pages=

  73. [100]

    Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval , pages=

    Building a question answering test collection , author=. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval , pages=

  74. [101]

    Semantic web , volume=

    Dbpedia--a large-scale, multilingual knowledge base extracted from wikipedia , author=. Semantic web , volume=. 2015 , publisher=

  75. [102]

    NeurIPS , volume=

    Character-level convolutional networks for text classification , author=. NeurIPS , volume=

  76. [103]

    EMNLP , pages=

    Recursive deep models for semantic compositionality over a sentiment treebank , author=. EMNLP , pages=

  77. [104]

    ACL , year=

    Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales , author=. ACL , year=

  78. [105]

    KDD , pages=

    Mining and summarizing customer reviews , author=. KDD , pages=

  79. [106]

    BBT v2: Towards a Gradient-Free Future with Large Language Models

    Sun, Tianxiang and He, Zhengfu and Qian, Hong and Zhou, Yunhua and Huang, Xuanjing and Qiu, Xipeng. BBT v2: Towards a Gradient-Free Future with Large Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022

  80. [107]

    International Conference on Machine Learning , pages=

    Black-box tuning for language-model-as-a-service , author=. International Conference on Machine Learning , pages=. 2022 , organization=

Showing first 80 references.