arxiv: 2512.11013 · v2 · submitted 2025-12-11 · 💻 cs.CL

PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

Pawel Batorski , Paul Swoboda This is my paper

Pith reviewed 2026-05-16 23:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords automatic promptingin-context learningShapley valuesfew-shot examplesprompt engineeringMonte Carlo estimationtext classificationmathematical reasoning

0 comments

The pith

A Monte Carlo Shapley-based method iteratively refines few-shot examples to set new state-of-the-art results among automatic prompting techniques on classification, simplification, and GSM8K.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PIAST as a way to build strong prompts from human instructions by starting with a small set of generated few-shot examples and then repeatedly keeping, dropping, or replacing them. It measures each example's contribution through Monte Carlo Shapley estimates and speeds up the loop with subsampling and a replay buffer so the whole process fits modest compute budgets. On limited resources the approach beats prior automatic methods on simplification and GSM8K while placing second on classification and summarization; with a bit more compute it reaches the top on three of the four tasks. The work argues that selecting and refining the right examples matters more for performance than exhaustive searches over instructions.

Core claim

PIAST augments a human instruction with a small set of few-shot examples and refines that set through an iterative keep/drop/replace loop driven by Monte Carlo Shapley estimates of example utility, accelerated by aggressive subsampling and a replay buffer. When run under limited compute it outperforms existing automatic prompting baselines on text simplification and GSM8K and ranks second on classification and summarization. With an extended yet still modest budget it establishes new state-of-the-art scores among automatic methods on classification, simplification, and GSM8K. These results indicate that carefully constructed examples, rather than exhaustive instruction search, form the main

What carries the argument

Iterative keep/drop/replace of few-shot examples guided by Monte Carlo Shapley estimates of their utility.

If this is right

With limited compute the method outperforms prior automatic prompting approaches on simplification and GSM8K and ranks second on classification and summarization.
With extended but still modest compute it reaches new state-of-the-art results among automatic methods on classification, simplification, and GSM8K.
Carefully constructed few-shot examples constitute the dominant lever for fast, data-efficient prompt engineering compared with exhaustive instruction search.
Aggressive subsampling and a replay buffer allow the utility-guided refinement loop to run efficiently under varying compute budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same utility estimation loop could be applied to other in-context learning tasks that currently rely on hand-picked examples.
If the Shapley estimates remain stable across different model sizes, the approach may reduce reliance on large held-out validation sets during prompt tuning.
Combining the example-refinement step with existing instruction-optimization techniques might yield further gains in low-data regimes.
The emphasis on example quality suggests that future automatic methods could focus more on generating candidate examples than on searching prompt wording.

Load-bearing premise

Monte Carlo Shapley estimates of example utility reliably identify which examples to keep, drop, or replace so the resulting prompts generalize better than baselines on held-out test data.

What would settle it

On a held-out test set the prompts produced by the Shapley-guided process achieve lower accuracy than prompts built from random or baseline example selection when both are given the same number of evaluations.

Figures

Figures reproduced from arXiv: 2512.11013 by Paul Swoboda, Pawel Batorski.

**Figure 1.** Figure 1: Overview of the results averaged over seven different text classification tasks, each run three times, comparing PIAST against current benchmarks. PIAST is able to generate high-quality prompts very efficiently, while requiring only a small portion of the dataset yielding comparable results to the current SOTA methods. 1 INTRODUCTION Automatic prompt engineering has emerged as a practical way to adapt LLMs… view at source ↗

**Figure 2.** Figure 2: Pipeline of PIAST. Initially, the Example Proposer generates examples, which are then iteratively improved by evaluating them with the Prompt Evaluator and choosing new examples from the Example Improver to incorporate into the set of current in-context examples. In this section, we present our method, which is composed of three components: the Prompt Proposer, the Prompt Evaluator, and the Prompt Improver… view at source ↗

**Figure 3.** Figure 3: We observe a clear trend: increasing the number of crafting iterations consistently improves accuracy, albeit at the cost of higher runtime. This highlights an appealing property of PIAST: its performance can be effectively scaled by allocating more computation time by increasing the number of crafting iterations. Moreover, the plot clearly shows that PIAST has anytime performance superior to the baseline… view at source ↗

**Figure 3.** Figure 3: Scaling of PIAST on SUBJ compared to other baselines while increasing the number of improvement iterations. We run this ablation on all classification tasks using the same hyperparameters as PIAST and report results in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

LLMs are highly sensitive to prompt design, but handcrafting effective prompts is difficult and often requires intricate crafting of few-shot examples. We propose a fast automatic prompt construction algorithm that augments human instructions by generating a small set of few shot examples. Our method iteratively replaces/drops/keeps few-shot examples using Monte Carlo Shapley estimation of example utility. For faster execution, we use aggressive subsampling and a replay buffer for faster evaluations. Our method can be run using different compute time budgets. On a limited budget, we outperform existing automatic prompting methods on text simplification and GSM8K and obtain second best results on classification and summarization. With an extended, but still modest compute budget we set a new state of the art among automatic prompting methods on classification, simplification and GSM8K. Our results show that carefully constructed examples, rather than exhaustive instruction search, are the dominant lever for fast and data efficient prompt engineering. Our code is available at https://github.com/Batorskq/PIAST.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIAST's Monte Carlo Shapley loop for iterative few-shot example selection is a genuine algorithmic step forward that delivers reported gains on several tasks with modest compute, though the variance in those estimates remains an open question.

read the letter

The core contribution here is a new way to build prompts by starting with a small set of generated few-shot examples and then iteratively keeping, dropping, or replacing them based on Monte Carlo Shapley estimates of their utility. They add aggressive subsampling and a replay buffer to keep runtime reasonable across different compute budgets. This combination does not reduce to the automatic prompting baselines they cite, and the public code makes it straightforward to test. The results show it beating prior methods on simplification and GSM8K under a tight budget, and claiming new state-of-the-art numbers among automatic methods on classification, simplification, and GSM8K once the budget is extended modestly. The takeaway that example selection matters more than exhaustive instruction search is stated plainly and matches what many practitioners observe in low-data settings.

Referee Report

3 major / 2 minor

Summary. The paper proposes PIAST, an automatic prompt construction algorithm that augments human instructions with a small set of few-shot examples selected via an iterative keep/drop/replace process driven by Monte Carlo Shapley estimates of example utility. Aggressive subsampling and a replay buffer are used for efficiency under varying compute budgets. Empirical results on classification, text simplification, summarization, and GSM8K claim outperformance over existing automatic prompting methods on limited budgets and new state-of-the-art results among such methods on classification, simplification, and GSM8K with an extended but modest budget. The work concludes that example construction dominates over exhaustive instruction search for data-efficient prompting.

Significance. If the reported gains prove robust under proper statistical controls, the method would provide a practical, fast approach to prompt engineering in scarce-data settings and reinforce the value of targeted example selection. Code release aids reproducibility. The significance is limited by incomplete experimental validation that leaves the reliability of the central empirical claims open to question.

major comments (3)

[Experiments] Experiments section (Tables 1–3): No information is given on the number of independent runs, variance across runs, or statistical significance tests for the reported accuracies and improvements. Without these, the SOTA claims under the extended budget cannot be reliably assessed and the outperformance over baselines remains only partially supported.
[§3.2] §3.2 (Monte Carlo Shapley estimation): The core iterative selection relies on Monte Carlo Shapley values computed under aggressive subsampling and replay buffer. No analysis of estimate variance, stability across subsamples, or correlation with held-out utility is provided. This directly bears on whether the keep/drop/replace decisions generalize or are dominated by sampling noise.
[§4.1] §4.1 (Baseline comparisons): Exact reproduction details for baselines (e.g., APE, other automatic prompting methods) are not specified, including prompt formats, example counts, and hyperparameter settings. This is load-bearing for the comparative claims on classification, simplification, and GSM8K.

minor comments (2)

[Abstract] Abstract: The phrase 'modest compute budget' is imprecise; reporting concrete wall-clock time or token counts for the limited and extended settings would improve clarity.
[Figure 1] Figure 1: The algorithm diagram caption could explicitly label the replay buffer and subsampling steps to match the text description in §3.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental validation and reproducibility. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical claims.

read point-by-point responses

Referee: [Experiments] Experiments section (Tables 1–3): No information is given on the number of independent runs, variance across runs, or statistical significance tests for the reported accuracies and improvements. Without these, the SOTA claims under the extended budget cannot be reliably assessed and the outperformance over baselines remains only partially supported.

Authors: We agree that reporting the number of independent runs, variance across runs, and statistical significance is essential for robust evaluation of the SOTA claims. In the revised manuscript, we will rerun the key experiments with multiple random seeds (e.g., 5 runs), report means and standard deviations in Tables 1–3, and include paired t-tests or similar tests to assess the significance of improvements over baselines. This will directly address the reliability concerns. revision: yes
Referee: [§3.2] §3.2 (Monte Carlo Shapley estimation): The core iterative selection relies on Monte Carlo Shapley values computed under aggressive subsampling and replay buffer. No analysis of estimate variance, stability across subsamples, or correlation with held-out utility is provided. This directly bears on whether the keep/drop/replace decisions generalize or are dominated by sampling noise.

Authors: We acknowledge the value of analyzing the Monte Carlo Shapley estimates for variance and stability. In the revision, we will add a discussion and supporting figures in §3.2 (or an appendix) showing the variance of the estimates under different subsample sizes, their stability across multiple runs of the Monte Carlo procedure, and their correlation with held-out performance on a validation set. This will demonstrate that the keep/drop/replace decisions are driven by genuine utility signals rather than noise, while preserving the efficiency benefits of subsampling and the replay buffer. revision: yes
Referee: [§4.1] §4.1 (Baseline comparisons): Exact reproduction details for baselines (e.g., APE, other automatic prompting methods) are not specified, including prompt formats, example counts, and hyperparameter settings. This is load-bearing for the comparative claims on classification, simplification, and GSM8K.

Authors: We will update §4.1 with complete reproduction details for all baselines, explicitly stating the prompt formats, number of few-shot examples, hyperparameter values, and any other implementation specifics used for APE and the other automatic prompting methods. These details will also be included in the code release to ensure the comparative results on classification, simplification, and GSM8K can be exactly replicated. revision: yes

Circularity Check

0 steps flagged

No significant circularity in algorithmic prompt construction

full rationale

The paper defines an iterative keep/drop/replace algorithm for few-shot examples driven by Monte Carlo Shapley utility estimates, with subsampling and replay buffer for speed. Performance claims rest on direct empirical comparisons to external baselines on held-out data for classification, simplification, and GSM8K. No equations reduce reported gains to fitted parameters or self-referential quantities by construction; no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. The method is procedurally specified and externally benchmarked, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Shapley-value estimates obtained via Monte Carlo sampling accurately reflect the marginal utility of individual few-shot examples for downstream LLM performance, plus the practical choice of subsampling rate and replay-buffer size to control runtime.

free parameters (2)

compute budget
The method is explicitly designed to be run under different time budgets that trade off speed against final prompt quality.
subsampling aggressiveness
Aggressive subsampling is introduced to accelerate evaluations but is not given a fixed value in the abstract.

axioms (1)

domain assumption Monte Carlo approximation of Shapley values provides a sufficiently accurate ranking of example utility to drive beneficial keep/drop/replace decisions.
This assumption underpins the entire iterative selection loop described in the abstract.

pith-pipeline@v0.9.0 · 5475 in / 1216 out tokens · 47530 ms · 2026-05-16T23:13:18.113385+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 14 internal anchors

[1]

Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu, J. Bian, and Y . Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2310.08510, 2023

work page arXiv 2023
[2]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bres- sand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Y . Zhao, J. Huang, J. Hu, X. Wang, Y . Mao, D. Zhang, Z. Jiang, Z. Wu, B. Ai, A. Wang, W. Zhou, and Y . Chen. SWIFT: A scalable lightweight infrastructure for fine-tuning.arXiv preprint arXiv:2408.05517, 2024

work page arXiv 2024
[4]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

W. Xu, C. Napoles, E. Pavlick, Q. Chen, and C. Callison-Burch. Optimizing statistical ma- chine translation for text simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

work page 2016
[6]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, 35:22199–22213, 2022

work page 2022
[8]

s1: Simple test-time scaling

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Cand`es, and T. Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM Computing Surveys, 55(9), 2023

work page 2023
[10]

D. Zhou, N. Sch ¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. H. Chi. Least-to-most prompting enables complex reasoning in large language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[11]

Shoeybi, M

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 1–15, 2019

work page 2019
[12]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain-of-thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

PRL: Prompts from Reinforcement Learning

P. Batorski, A. Kosmala, and P. Swoboda. PRL: Prompts from reinforcement learning.arXiv preprint arXiv:2505.14412, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022
[15]

gradient descent

R. Pryzant, D. Iter, J. Li, Y . T. Lee, C. Zhu, and M. Zeng. Automatic prompt optimization with “gradient descent” and beam search.arXiv preprint arXiv:2305.03495, 2023

work page arXiv 2023
[16]

Alva-Manchego, L

F. Alva-Manchego, L. Martin, A. Bordes, C. Scarton, B. Sagot, and L. Specia. ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations.arXiv preprint arXiv:2005.00481, 2020. 10

work page arXiv 2005
[17]

C.-Y . Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out: Proceedings of the ACL-04 Workshop, 2004

work page 2004
[18]

Pang and L

B. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. InProceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 115–124. Association for Computational Linguistics, 2005

work page 2005
[19]

Reynolds and K

L. Reynolds and K. McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. InExtended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021

work page 2021
[20]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, 36:11809–11822, 2023

work page 2023
[21]

Besta, N

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al. Graph of thoughts: Solving elaborate prob- lems with large language models. InProceedings of the AAAI Conference on Artificial Intelli- gence, 38(16):17682–17690, 2024

work page 2024
[22]

Razavi, M

A. Razavi, M. Soltangheis, N. Arabzadeh, S. Salamat, M. Zihayat, and E. Bagheri. Bench- marking prompt sensitivity in large language models. InEuropean Conference on Information Retrieval, pages 303–313. Springer, 2025

work page 2025
[23]

B. Chen, Z. Zhang, N. Langren ´e, and S. Zhu. Unleashing the potential of prompt engineering in large language models: A comprehensive review.arXiv preprint arXiv:2310.14735, 2023

work page arXiv 2023
[24]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations (ICLR), 2022

work page 2022
[25]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Q. Ye, M. Axmed, R. Pryzant, and F. Khani. Prompt engineering a prompt engineer.arXiv preprint arXiv:2311.05661, 2023

work page arXiv 2023
[27]

Errica, G

F. Errica, G. Siracusano, D. Sanvito, and R. Bifulco. What did I do wrong? quantifying LLMs’ sensitivity and consistency to prompt engineering.arXiv preprint arXiv:2406.12334, 2024

work page arXiv 2024
[28]

W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Batorski and P

P. Batorski and P. Swoboda. GPS: General per-sample prompter.arXiv preprint arXiv:2511.21714, 2025

work page arXiv 2025
[30]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain- of-thought prompting elicits reasoning in large language models. InAdvances in Neural Infor- mation Processing Systems, 35:24824–24837, 2022

work page 2022
[31]

Sivarajkumar, M

S. Sivarajkumar, M. Kelley, A. Samolyk-Mazzanti, S. Visweswaran, and Y . Wang. An empir- ical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: algorithm development and validation study.JMIR Medical Informatics, 12:e55318, 2024

work page 2024
[32]

Greenblatt

R. Greenblatt. Getting 50% (SoTA) on ARC-AGI with GPT-4o. Redwood Research Substack, 2024.https://redwoodresearch.substack.com/p/ getting-50-sota-on-arc-agi-with-gpt

work page 2024
[33]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Lan- guage models are unsupervised multitask learners. OpenAI Technical Report, 2019.https://cdn.openai.com/better-language-models/language_ models_are_unsupervised_multitask_learners.pdf. 11

work page 2019
[34]

Korthikanti, Z

V . Korthikanti, Z. Yu, Z. Yao, Y . Zhu, Z. Shao, L. Zheng, B. Reagen, T. Chen, and R. Jain. vLLM: Easy, fast, and cheap LLM serving with PagedAttention. InProceedings of the ACM Symposium on Cloud Computing (SoCC), pages 1–15, 2023

work page 2023
[35]

G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun. Orca: A distributed serving system for transformer-based generative models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22), pages 521–538, 2022

work page 2022
[36]

Y . Xu, W. Li, P. Vaezipoor, S. Sanner, and E. B. Khalil. LLMs and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations.arXiv preprint arXiv:2305.18354, 2023

work page arXiv 2023
[37]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 33:1877–1901, 2020

work page 1901
[38]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, 30, 2017

work page 2017
[41]

Pang and L

B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summa- rization based on minimum cuts. InProceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 271–278, 2004

work page 2004
[42]

Zhang, J

X. Zhang, J. Zhao, and Y . LeCun. Character-level convolutional networks for text classifica- tion. InAdvances in Neural Information Processing Systems, 28, 2015

work page 2015
[43]

E. M. V oorhees and D. M. Tice. Building a question answering test collection. InProceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 200–207, 2000

work page 2000
[44]

Hu and B

M. Hu and B. Liu. Mining and summarizing customer reviews. InProceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 168–177. Association for Computing Machinery, 2004

work page 2004
[45]

Socher, A

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y . Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642. Association for Computational Linguistics, 2013

work page 2013
[46]

Z. Shao, P. Wang, Q. Zhu, R. Xu, and J. Song. DeepSeekMath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Samsum cor- pus: A human-annotated dialogue dataset for abstractive summarization

B. Gliwa, I. Mochol, M. Biesek, and A. Wawer. SAMSum corpus: A human-annotated dia- logue dataset for abstractive summarization.arXiv preprint arXiv:1911.12237, 2019

work page arXiv 1911
[48]

Y . Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations, 2022

work page 2022
[49]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al. OPT: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions.arXiv preprint arXiv:2104.08773, 2021. 12

work page internal anchor Pith review arXiv 2021
[51]

Llama 3 Model Card

AI@Meta. Llama 3 Model Card. Technical report, 2024.https://github.com/ meta-llama/llama3/blob/main/MODEL_CARD.md

work page 2024
[52]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

the more, the better

M. Deng, J. Wang, C.-P. Hsieh, Y . Wang, H. Guo, T. Shu, M. Song, E. P. Xing, and Z. Hu. RLPrompt: Optimizing discrete text prompts with reinforcement learning.arXiv preprint arXiv:2205.12548, 2022. 13 A PSEUDOCODES We present concise pseudocodes for our method and its Shapley-driven oracle. Algorithm 1 orches- trates the full crafting loop: starting from...

work page arXiv 2022