pith. sign in

arxiv: 2505.14412 · v2 · submitted 2025-05-20 · 💻 cs.AI · cs.CL

PRL: Prompts from Reinforcement Learning

Pith reviewed 2026-05-22 14:20 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords prompt engineeringreinforcement learningautomatic prompt generationfew-shot examplestext classificationsummarizationsimplification
0
0 comments X

The pith

Reinforcement learning generates novel few-shot examples to create more effective prompts for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PRL, a method that uses reinforcement learning to automatically generate prompts for LLMs. It can invent new few-shot examples that never appeared in the training data, which prior prompt optimization techniques could not do. This produces stronger results on text classification, summarization, and simplification benchmarks compared with earlier approaches. A sympathetic reader would care because the method lowers the barrier of expert intuition needed to steer LLMs on routine language tasks.

Core claim

PRL is a reinforcement learning approach for automatic prompt generation that produces novel few-shot examples absent from the training data. It reaches state-of-the-art results on text classification, simplification, and summarization, surpassing prior methods by 2.58 percent over APE and 1.00 percent over EvoPrompt on classification, by 4.32 ROUGE over APE and 2.12 over EvoPrompt on summarization, and by 6.93 SARI over APE and 6.01 over EvoPrompt on simplification.

What carries the argument

Reinforcement learning process that discovers and outputs novel few-shot examples for use in prompts

If this is right

  • Surpasses APE and EvoPrompt on text classification accuracy
  • Raises average ROUGE scores on summarization tasks
  • Increases SARI scores on text simplification
  • Creates prompts that include few-shot examples never seen during training

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same RL generation process could be tested on additional language tasks such as question answering or translation
  • It may reduce the amount of human trial-and-error currently required to obtain reliable LLM outputs in production settings
  • Future combinations of this method with human preference data could further refine the quality of automatically discovered prompts

Load-bearing premise

The reinforcement learning process can discover and produce novel few-shot examples that were not present in the training data and that these examples meaningfully improve downstream task performance beyond what standard prompt optimization achieves.

What would settle it

Training and evaluating PRL on the same benchmarks and finding that the generated prompts contain only training-set examples or produce no accuracy or score gains over APE and EvoPrompt would falsify the central claim.

Figures

Figures reproduced from arXiv: 2505.14412 by Adrian Kosmala, Paul Swoboda, Pawe{\l} Batorski.

Figure 1
Figure 1. Figure 1: Left: Our RL-based prompt optimization cycle (overview). Right: Comparison of prompt-engineering methods. PRL automates both prompt generation and refinement and, synthesizes novel task-specific few-shot examples. The yellow tilde (∼) for APO and PromptAgent indicates limited few-shot support, its examples are drawn from training data, which restricts performance, whereas PRL creates new instances not seen… view at source ↗
Figure 2
Figure 2. Figure 2: Training scheme of PRL. First, the Prompt Generator [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt used by PRL PRL In this task, you will classify the sentiment of movie review sentences as ‘positive’ or ‘negative’. Examples: “The movie was thrilling and exciting” → positive; “The plot was boring and predictable” → negative. Return only the label. Acc.: 96.38 Manual Instruction Please perform Sentiment Classi￾fication. Given the sentence, as￾sign a label from [‘negative’, ‘pos￾itive’]. Return the… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of a manual instruction, the best [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of averaged ROUGE metrics based on prompts generated by PRL, EvoPrompt, and Manual Instruction for the summarization task. This fig￾ure highlights the importance of precise prompt design: although the two prompts generated by EvoPrompt on two different seeds are superficially similar, they result in significantly different performance. In contrast, the PRL prompt is both more effective and bette… view at source ↗
Figure 9
Figure 9. Figure 9: Best prompt generated by PRL for SST2 classification task along with accuracy. PRL In this task, you are given sentences from movie reviews. Your goal is to classify each sentence as ’positive’ or ’negative’ based on its sentiment. Pay close attention to the context and nuances in the text, as the sentiment might not be explicitly stated. Examples: - "The acting was superb, and the plot was engaging." -> p… view at source ↗
Figure 10
Figure 10. Figure 10: Best prompt generated by PRL for CR clas [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Best prompt generated by PRL for MR clas [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 8
Figure 8. Figure 8: Best prompt generated by PRL for SUBJ classification task along with accuracy. PRL In this task, you will classify the sentiment of movie review sentences as ’positive’ or ’negative’. Examples: "The movie was thrilling and exciting" -> positive; "The plot was boring and predictable" -> negative. Return only the label: ’positive’ or ’negative’. Acc.: 96.38 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 13
Figure 13. Figure 13: Best prompt generated by PRL for AG’s News classification task along with accuracy. PRL Please perform a Question Classification task. Given a question, classify it into one of the following categories: - **Description**: Questions asking for descriptions or explanations. - **Entity**: Questions asking about specific things, objects, or entities. - **Expression**: Questions asking about how something is e… view at source ↗
Figure 14
Figure 14. Figure 14: Best prompt generated by PRL for TREC classification task along with accuracy. J Appendix - Usage of LLMs LLMs were used for editorial support, including polishing the manuscript’s writing and presentation, and for help drafting some implementation code. The authors made all substantive research decisions and contributed all core technical ideas. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
read the original abstract

Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at https://github.com/Batorskq/prl .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PRL, a reinforcement learning approach for automatic prompt generation that claims to produce novel few-shot examples not present in the training data. It reports state-of-the-art results on text classification, simplification, and summarization, with specific gains such as 2.58% over APE and 1.00% over EvoPrompt on classification, plus ROUGE and SARI improvements on the other tasks. Code is released at https://github.com/Batorskq/prl.

Significance. If the novelty and causality claims are substantiated, PRL would offer a meaningful step in automated prompt optimization by showing RL can surface effective examples beyond standard search methods. The open-source code strengthens the contribution by enabling direct reproduction and extension.

major comments (2)
  1. [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the headline performance deltas (2.58 % classification, 4.32 ROUGE summarization, 6.93 SARI simplification) are presented without ablations that hold the rest of the pipeline fixed while replacing RL-generated shots with randomly sampled training shots; without this control the gains cannot be attributed to novelty rather than search budget or reward shaping.
  2. [§3 (Method)] §3 (Method): the central claim that PRL 'can produce novel few-shot examples that were not seen during training' lacks any verification step (deduplication pass, overlap metric, or example listing) to confirm the generated examples lie outside the training distribution; this verification is load-bearing for distinguishing PRL from prior prompt-optimization baselines.
minor comments (2)
  1. [Figures and Tables] Table captions and axis labels in the experimental figures should explicitly state the number of runs and whether error bars represent standard deviation or standard error.
  2. [§3.2 (Reward Design)] The description of the RL reward function should include the precise scaling hyperparameters and any clipping applied, as these appear among the free parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important points for strengthening the attribution of gains to novelty and for verifying the core claim. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the headline performance deltas (2.58 % classification, 4.32 ROUGE summarization, 6.93 SARI simplification) are presented without ablations that hold the rest of the pipeline fixed while replacing RL-generated shots with randomly sampled training shots; without this control the gains cannot be attributed to novelty rather than search budget or reward shaping.

    Authors: We agree that the current experiments do not include a direct ablation isolating the effect of RL-generated novel shots against randomly sampled training shots under an otherwise identical pipeline. This control would strengthen the causal link between novelty and performance. In the revised manuscript we will add this ablation to §4, reporting results for random-shot baselines matched on search budget and reward model while keeping all other components fixed. We will also discuss how the comparison to APE and EvoPrompt already provides partial evidence, but the new control will address the referee's concern directly. revision: yes

  2. Referee: [§3 (Method)] §3 (Method): the central claim that PRL 'can produce novel few-shot examples that were not seen during training' lacks any verification step (deduplication pass, overlap metric, or example listing) to confirm the generated examples lie outside the training distribution; this verification is load-bearing for distinguishing PRL from prior prompt-optimization baselines.

    Authors: We acknowledge that an explicit verification step is necessary to substantiate the novelty claim. Although the RL objective and generation process are intended to produce unseen examples, we did not report quantitative checks in the original submission. In the revised version we will add to §3 (and the appendix) a deduplication analysis including exact string matching, n-gram overlap statistics, and embedding-based similarity thresholds between generated shots and the training set. We will also include representative example listings to illustrate the novel content. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in PRL derivation

full rationale

The paper introduces PRL as an RL-based method for generating prompts and novel few-shot examples, reporting performance gains on classification, summarization, and simplification benchmarks. No equations, self-definitional steps, fitted-input predictions, or load-bearing self-citations are present in the abstract or described method that reduce the central claims to the inputs by construction. The approach is presented as self-contained with public code, and performance deltas are framed as empirical outcomes rather than tautological re-statements of training data or prior self-citations. This is the expected honest non-finding for a method paper whose core contribution does not rely on the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the method implicitly relies on standard RL assumptions and benchmark-specific reward definitions.

free parameters (1)
  • RL reward scaling and hyperparameters
    Reinforcement learning for prompt generation typically requires tuned parameters such as learning rate, discount factor, and reward weighting that are fitted or chosen to achieve reported performance.
axioms (1)
  • domain assumption A well-defined scalar reward signal exists that accurately measures prompt quality on the target tasks.
    The RL loop depends on this signal being a faithful proxy for downstream performance.

pith-pipeline@v0.9.0 · 5733 in / 1250 out tokens · 45345 ms · 2026-05-22T14:20:18.128018+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

    cs.CL 2025-12 conditional novelty 7.0

    PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with...

  2. Prompt Optimization for LLM Code Generation via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 5.0

    A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 2 Pith papers · 6 internal anchors

  1. [1]

    Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, and 181 others. 2025. Deepseek-v3 technica...

  2. [2]

    Preprint, arXiv:2308.10819

    Evaluating the instruction-following robust- ness of large language models to prompt injection. Preprint, arXiv:2308.10819. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe

  3. [3]

    Let's Verify Step by Step

    Let’s verify step by step.arXiv preprint arXiv:2305.20050. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out: Proceedings of the ACL-04 Workshop. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Over- coming f...

  4. [4]

    A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927. Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models’ sensitiv- ity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. Preprint, arXiv:2310...

  5. [5]

    Spurious Rewards: Rethinking Training Signals in RLVR

    Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, and Junxiao Song. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Wenhang Shi, Yiren Chen, Shuqing Bian, Xinyi Zhang, Kai Tang, Pengfei Hu, Zhe Zhao...

  6. [6]

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts

    An empirical evaluation of prompting strate- gies for large language models in zero-shot clinical natural language processing: algorithm development and validation study.JMIR Medical Informatics, 12:e55318. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for sema...

  7. [7]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10696–10710, Miami, Florida, USA

    Fine-tuning and prompt optimization: Two great steps that work better together. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10696–10710, Miami, Florida, USA. Association for Computational Linguistics. Ellen M V oorhees and Dawn M Tice. 2000. Building a question answering test collection. InProceedings of...

  8. [8]

    Qwen2.5 Technical Report

    Qwen2.5 technical report.arXiv preprint arXiv:2412.15115. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen

  9. [9]

    arXiv preprint arXiv:2501.05464 , year=

    Large language models as optimizers. In 11 The Twelfth International Conference on Learning Representations. Hang Yang, Hao Chen, Hui Guo, Yineng Chen, Ching- Sheng Lin, Shu Hu, Jinrong Hu, Xi Wu, and Xin Wang. 2025. Llm-medqa: Enhancing medical ques- tion answering through case studies in large language models.Preprint, arXiv:2501.05464. Shunyu Yao, Dian...

  10. [10]

    OPT: Open Pre-trained Transformer Language Models

    Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Seungyoun Yi, Minsoo Khang, and Sungrae Park. 2025. Zera: Zero-init instruction evolving refinement agent– from zero instructions to structured prompts via principle-based optimization. InProceedings of the 2025 Confer...

  11. [11]

    The movie was thrilling and exciting

    with ϵ= 0.2 , β= 0.04 , and weight de- cay 0.1. We additionally apply Low-Rank Adap- tation (LoRA) (Hu et al., 2022) with learning rate 1×10 −6,α= 32, and rankr= 8. Sampling and Prompt Selection.During train- ing, we sample n= 4 prompts per iteration and perform Prompt Selection every 100 iterations. Shared reward hyperparameters.Across all tasks we set r...

  12. [12]

    Shortening sentences

  13. [13]

    Using simpler vocabulary

  14. [14]

    I think”,“it’s great

    Removing unnecessary or redundant words. Example of simplified text: Original: The quick brown fox jumped over the fence, which was quite a high jump for such a small creature. Simplified: The quick brown fox jumped over the high fence. Your task is to make the given text easier to understand while maintaining its original meaning. SARI:57.19 Manual Instr...

  15. [15]

    Recent work shows that aligned LLMs re- main highly sensitive to small, semantics- preserving prompt changes: • (Sclar et al., 2024) find that subtle format- ting choices (e.g., bullet style, whitespace, placement of labels) can change few-shot accuracy on LLaMA-2-13B by up to 76 ac- curacy points, and that this sensitivity per- sists even with larger mod...

  16. [16]

    • (Errica et al., 2024) show that, even when average accuracy is high, minor changes 16 Table 13: Accuracy achieved by prompts from RLPrompt and PRL on classification tasks

    show that, even for instruction-tuned chat models, the gap between the best and worst semantically equivalent prompts re- mains large, and that small rephrasings can significantly alter model behaviour. • (Errica et al., 2024) show that, even when average accuracy is high, minor changes 16 Table 13: Accuracy achieved by prompts from RLPrompt and PRL on cl...

  17. [17]

    Small, human-plausible perturbations still hurt robustness: • (Zhu et al., 2024) demonstrate that minor ty- pos, synonym substitutions, and light para- phrases, changes that preserve meaning for humans, can substantially degrade perfor- mance of LLMs. • (Li et al., 2023) show that short, injected instructions can override original goals in aligned instruc...

  18. [18]

    Classical in-context learning results already showed the importance of prompt choice: • (Zhao et al., 2021) and (Lu et al., 2022) show that the choice and ordering of few- shot examples can move GPT-style models from near-chance to near–state-of-the-art performance. • (Min et al., 2022) further show that how demonstrations are written often matters more t...

  19. [19]

    high prompt sensitivity

    Even recent large RL-aligned reasoning mod- els exhibit prompt sensitivity: • The DeepSeek-R1 paper (Guo et al., 2025) explicitly notes that a 671B-parameter RL- trained reasoning model remains sensitive to prompt design, and that seemingly rea- sonable prompting strategies (e.g., few-shot prompting) can decrease performance. • The GRACE framework (Shi et...

  20. [20]

    pretrained

    Our experiments give additional quantitative evidence on aligned models: • We evaluate PRL on instruction-tuned chat models (Qwen2.5-7B-Instruct, Qwen2- 14B-Instruct, Qwen2-32B-Instruct, and LLaMA-3.1-8B-Instruct), i.e., models that already underwent instruction tuning/align- ment. • Larger evaluation models (Qwen2-14B and Qwen2-32B) still benefit from PR...

  21. [21]

    Read the article thoroughly to understand its main subject matter

  22. [22]

    - ’Sports’: articles discussing various sports, competitions, athletes, etc

    Determine which of the following categories the article’s main topic most closely aligns with: - ’World’: articles covering global news, politics, international affairs, etc. - ’Sports’: articles discussing various sports, competitions, athletes, etc. - ’Business’: articles focusing on financial news, corporate activities, mar- kets, etc. - ’Tech’: articl...

  23. [23]

    If the article’s content is not clearly related to any of these categories, choose the closest option based on the predominant subject matter

  24. [24]

    Apple Launches New iPhone Model with Improved Camera Fea- tures

    Return the label of the chosen category as a single word without any ad- ditional text or explanations, e.g., ’World’, ’Sports’, ’Business’, or ’Tech’. Example: Article: "Apple Launches New iPhone Model with Improved Camera Fea- tures" Label: Tech Article: "China and the US Reach a New Trade Agreement" Label: World Article: "Local Soccer Team Qualifies fo...