PRL: Prompts from Reinforcement Learning
Pith reviewed 2026-05-22 14:20 UTC · model grok-4.3
The pith
Reinforcement learning generates novel few-shot examples to create more effective prompts for large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PRL is a reinforcement learning approach for automatic prompt generation that produces novel few-shot examples absent from the training data. It reaches state-of-the-art results on text classification, simplification, and summarization, surpassing prior methods by 2.58 percent over APE and 1.00 percent over EvoPrompt on classification, by 4.32 ROUGE over APE and 2.12 over EvoPrompt on summarization, and by 6.93 SARI over APE and 6.01 over EvoPrompt on simplification.
What carries the argument
Reinforcement learning process that discovers and outputs novel few-shot examples for use in prompts
If this is right
- Surpasses APE and EvoPrompt on text classification accuracy
- Raises average ROUGE scores on summarization tasks
- Increases SARI scores on text simplification
- Creates prompts that include few-shot examples never seen during training
Where Pith is reading between the lines
- The same RL generation process could be tested on additional language tasks such as question answering or translation
- It may reduce the amount of human trial-and-error currently required to obtain reliable LLM outputs in production settings
- Future combinations of this method with human preference data could further refine the quality of automatically discovered prompts
Load-bearing premise
The reinforcement learning process can discover and produce novel few-shot examples that were not present in the training data and that these examples meaningfully improve downstream task performance beyond what standard prompt optimization achieves.
What would settle it
Training and evaluating PRL on the same benchmarks and finding that the generated prompts contain only training-set examples or produce no accuracy or score gains over APE and EvoPrompt would falsify the central claim.
Figures
read the original abstract
Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at https://github.com/Batorskq/prl .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PRL, a reinforcement learning approach for automatic prompt generation that claims to produce novel few-shot examples not present in the training data. It reports state-of-the-art results on text classification, simplification, and summarization, with specific gains such as 2.58% over APE and 1.00% over EvoPrompt on classification, plus ROUGE and SARI improvements on the other tasks. Code is released at https://github.com/Batorskq/prl.
Significance. If the novelty and causality claims are substantiated, PRL would offer a meaningful step in automated prompt optimization by showing RL can surface effective examples beyond standard search methods. The open-source code strengthens the contribution by enabling direct reproduction and extension.
major comments (2)
- [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the headline performance deltas (2.58 % classification, 4.32 ROUGE summarization, 6.93 SARI simplification) are presented without ablations that hold the rest of the pipeline fixed while replacing RL-generated shots with randomly sampled training shots; without this control the gains cannot be attributed to novelty rather than search budget or reward shaping.
- [§3 (Method)] §3 (Method): the central claim that PRL 'can produce novel few-shot examples that were not seen during training' lacks any verification step (deduplication pass, overlap metric, or example listing) to confirm the generated examples lie outside the training distribution; this verification is load-bearing for distinguishing PRL from prior prompt-optimization baselines.
minor comments (2)
- [Figures and Tables] Table captions and axis labels in the experimental figures should explicitly state the number of runs and whether error bars represent standard deviation or standard error.
- [§3.2 (Reward Design)] The description of the RL reward function should include the precise scaling hyperparameters and any clipping applied, as these appear among the free parameters.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important points for strengthening the attribution of gains to novelty and for verifying the core claim. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the headline performance deltas (2.58 % classification, 4.32 ROUGE summarization, 6.93 SARI simplification) are presented without ablations that hold the rest of the pipeline fixed while replacing RL-generated shots with randomly sampled training shots; without this control the gains cannot be attributed to novelty rather than search budget or reward shaping.
Authors: We agree that the current experiments do not include a direct ablation isolating the effect of RL-generated novel shots against randomly sampled training shots under an otherwise identical pipeline. This control would strengthen the causal link between novelty and performance. In the revised manuscript we will add this ablation to §4, reporting results for random-shot baselines matched on search budget and reward model while keeping all other components fixed. We will also discuss how the comparison to APE and EvoPrompt already provides partial evidence, but the new control will address the referee's concern directly. revision: yes
-
Referee: [§3 (Method)] §3 (Method): the central claim that PRL 'can produce novel few-shot examples that were not seen during training' lacks any verification step (deduplication pass, overlap metric, or example listing) to confirm the generated examples lie outside the training distribution; this verification is load-bearing for distinguishing PRL from prior prompt-optimization baselines.
Authors: We acknowledge that an explicit verification step is necessary to substantiate the novelty claim. Although the RL objective and generation process are intended to produce unseen examples, we did not report quantitative checks in the original submission. In the revised version we will add to §3 (and the appendix) a deduplication analysis including exact string matching, n-gram overlap statistics, and embedding-based similarity thresholds between generated shots and the training set. We will also include representative example listings to illustrate the novel content. revision: yes
Circularity Check
No significant circularity detected in PRL derivation
full rationale
The paper introduces PRL as an RL-based method for generating prompts and novel few-shot examples, reporting performance gains on classification, summarization, and simplification benchmarks. No equations, self-definitional steps, fitted-input predictions, or load-bearing self-citations are present in the abstract or described method that reduce the central claims to the inputs by construction. The approach is presented as self-contained with public code, and performance deltas are framed as empirical outcomes rather than tautological re-statements of training data or prior self-citations. This is the expected honest non-finding for a method paper whose core contribution does not rely on the enumerated circular patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL reward scaling and hyperparameters
axioms (1)
- domain assumption A well-defined scalar reward signal exists that accurately measures prompt quality on the target tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our approach achieves state-of-the-art performance across a range of benchmarks... PRL can produce novel few-shot examples that were not seen during training.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Reward Function. Our reward combines a formatting reward... R = R_token + R_structure + R_format + R_alignment
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data
PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with...
-
Prompt Optimization for LLM Code Generation via Reinforcement Learning
A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.
Reference graph
Works this paper leans on
-
[1]
Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, and 181 others. 2025. Deepseek-v3 technica...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Evaluating the instruction-following robust- ness of large language models to prompt injection. Preprint, arXiv:2308.10819. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe
-
[3]
Let’s verify step by step.arXiv preprint arXiv:2305.20050. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out: Proceedings of the ACL-04 Workshop. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Over- coming f...
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927. Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models’ sensitiv- ity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. Preprint, arXiv:2310...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Spurious Rewards: Rethinking Training Signals in RLVR
Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, and Junxiao Song. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Wenhang Shi, Yiren Chen, Shuqing Bian, Xinyi Zhang, Kai Tang, Pengfei Hu, Zhe Zhao...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
An empirical evaluation of prompting strate- gies for large language models in zero-shot clinical natural language processing: algorithm development and validation study.JMIR Medical Informatics, 12:e55318. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for sema...
work page 2013
-
[7]
Fine-tuning and prompt optimization: Two great steps that work better together. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10696–10710, Miami, Florida, USA. Association for Computational Linguistics. Ellen M V oorhees and Dawn M Tice. 2000. Building a question answering test collection. InProceedings of...
-
[8]
Qwen2.5 technical report.arXiv preprint arXiv:2412.15115. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2501.05464 , year=
Large language models as optimizers. In 11 The Twelfth International Conference on Learning Representations. Hang Yang, Hao Chen, Hui Guo, Yineng Chen, Ching- Sheng Lin, Shu Hu, Jinrong Hu, Xi Wu, and Xin Wang. 2025. Llm-medqa: Enhancing medical ques- tion answering through case studies in large language models.Preprint, arXiv:2501.05464. Shunyu Yao, Dian...
-
[10]
OPT: Open Pre-trained Transformer Language Models
Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822. Seungyoun Yi, Minsoo Khang, and Sungrae Park. 2025. Zera: Zero-init instruction evolving refinement agent– from zero instructions to structured prompts via principle-based optimization. InProceedings of the 2025 Confer...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
The movie was thrilling and exciting
with ϵ= 0.2 , β= 0.04 , and weight de- cay 0.1. We additionally apply Low-Rank Adap- tation (LoRA) (Hu et al., 2022) with learning rate 1×10 −6,α= 32, and rankr= 8. Sampling and Prompt Selection.During train- ing, we sample n= 4 prompts per iteration and perform Prompt Selection every 100 iterations. Shared reward hyperparameters.Across all tasks we set r...
work page 2022
-
[12]
Shortening sentences
-
[13]
Using simpler vocabulary
-
[14]
Removing unnecessary or redundant words. Example of simplified text: Original: The quick brown fox jumped over the fence, which was quite a high jump for such a small creature. Simplified: The quick brown fox jumped over the high fence. Your task is to make the given text easier to understand while maintaining its original meaning. SARI:57.19 Manual Instr...
work page 2022
-
[15]
Recent work shows that aligned LLMs re- main highly sensitive to small, semantics- preserving prompt changes: • (Sclar et al., 2024) find that subtle format- ting choices (e.g., bullet style, whitespace, placement of labels) can change few-shot accuracy on LLaMA-2-13B by up to 76 ac- curacy points, and that this sensitivity per- sists even with larger mod...
work page 2024
-
[16]
show that, even for instruction-tuned chat models, the gap between the best and worst semantically equivalent prompts re- mains large, and that small rephrasings can significantly alter model behaviour. • (Errica et al., 2024) show that, even when average accuracy is high, minor changes 16 Table 13: Accuracy achieved by prompts from RLPrompt and PRL on cl...
work page 2024
-
[17]
Small, human-plausible perturbations still hurt robustness: • (Zhu et al., 2024) demonstrate that minor ty- pos, synonym substitutions, and light para- phrases, changes that preserve meaning for humans, can substantially degrade perfor- mance of LLMs. • (Li et al., 2023) show that short, injected instructions can override original goals in aligned instruc...
work page 2024
-
[18]
Classical in-context learning results already showed the importance of prompt choice: • (Zhao et al., 2021) and (Lu et al., 2022) show that the choice and ordering of few- shot examples can move GPT-style models from near-chance to near–state-of-the-art performance. • (Min et al., 2022) further show that how demonstrations are written often matters more t...
work page 2021
-
[19]
Even recent large RL-aligned reasoning mod- els exhibit prompt sensitivity: • The DeepSeek-R1 paper (Guo et al., 2025) explicitly notes that a 671B-parameter RL- trained reasoning model remains sensitive to prompt design, and that seemingly rea- sonable prompting strategies (e.g., few-shot prompting) can decrease performance. • The GRACE framework (Shi et...
work page 2025
-
[20]
Our experiments give additional quantitative evidence on aligned models: • We evaluate PRL on instruction-tuned chat models (Qwen2.5-7B-Instruct, Qwen2- 14B-Instruct, Qwen2-32B-Instruct, and LLaMA-3.1-8B-Instruct), i.e., models that already underwent instruction tuning/align- ment. • Larger evaluation models (Qwen2-14B and Qwen2-32B) still benefit from PR...
-
[21]
Read the article thoroughly to understand its main subject matter
-
[22]
- ’Sports’: articles discussing various sports, competitions, athletes, etc
Determine which of the following categories the article’s main topic most closely aligns with: - ’World’: articles covering global news, politics, international affairs, etc. - ’Sports’: articles discussing various sports, competitions, athletes, etc. - ’Business’: articles focusing on financial news, corporate activities, mar- kets, etc. - ’Tech’: articl...
-
[23]
If the article’s content is not clearly related to any of these categories, choose the closest option based on the predominant subject matter
-
[24]
Apple Launches New iPhone Model with Improved Camera Fea- tures
Return the label of the chosen category as a single word without any ad- ditional text or explanations, e.g., ’World’, ’Sports’, ’Business’, or ’Tech’. Example: Article: "Apple Launches New iPhone Model with Improved Camera Fea- tures" Label: Tech Article: "China and the US Reach a New Trade Agreement" Label: World Article: "Local Soccer Team Qualifies fo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.