pith. sign in

arxiv: 2606.11459 · v1 · pith:TKL6EN2Vnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI· cs.LG

APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

Pith reviewed 2026-06-27 13:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords prompt optimizationevolutionary algorithmsdata efficiencylarge language modelsautomatic prompt engineeringdynamic data selection
0
0 comments X

The pith

APEX improves automatic prompt optimization by dynamically selecting which data to evaluate during the search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents APEX as a way to make evolutionary prompt search more data-efficient. Instead of evaluating every candidate prompt on the full fixed dataset, it tracks how the search has progressed and splits the data into easy, hard, and mixed tiers. The mixed tier supplies the most useful examples for creating new prompt variants and for telling which variants are actually better. Under a budget of 5,000 evaluations this produces average gains of 11.2 percent on Gemini 2.5 Flash and 6.8 percent on Gemma 3 27B across three benchmarks.

Core claim

APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. Prioritizing the Mixed tier identifies the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality, which together yield higher performance than static data usage under the same evaluation budget.

What carries the argument

Dynamic stratification of the evaluation set into Easy, Hard, and Mixed tiers based on optimization lineage, used to select addressable and rank-sensitive data subsets.

If this is right

  • Prompt search can be made substantially more efficient without changing the underlying mutation operators.
  • Performance gains appear on instruction-following, question-answering, and grounding tasks when the same budget is used.
  • The same data-tiering idea could be applied to other iterative optimization loops that consume evaluation calls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may reduce the total number of model calls needed to reach a target prompt quality in production settings.
  • If the tiering logic proves stable across different base models, it could become a standard module in prompt-optimization libraries.

Load-bearing premise

That splitting data into easy, hard, and mixed tiers according to the optimization history reliably marks the subsets that best support mutation generation and candidate ranking.

What would settle it

Running the same evolutionary search with random data selection under the identical 5,000-call budget and obtaining equal or larger gains on the three benchmarks.

Figures

Figures reproduced from arXiv: 2606.11459 by Cho-Jui Hsieh, Fei Wang, Inderjit S. Dhillon, Si Si.

Figure 1
Figure 1. Figure 1: The APEX framework overview. The optimization process begins by tracking prompt lineage (left), where historical performance across ancestors (e.g., p2) and children (e.g., p6) is analyzed over a defined lookback window. In the Data Dynamics phase (center), datapoints are dynamically categorized into tiers (Easy, Hard, and Mixed) based on pass/fail trajectories. The Mixed set identifies informative datapoi… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of test accuracy versus budget (i.e., number of evaluation calls) on IF￾Bench with Gemini 2.5 Flash. The performance margin between APEX and baselines becomes larger as budget increases [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative analysis of APEX prompt evolution. The left panel tracks the optimization trajectory, highlighting when key instructional strategies were discovered. The right panel shows the final prompt, color-coded to match these milestones. Together, they demonstrate how APEX iteratively builds high-performance prompts by accumulating improvements over time. 5.3 Analysis To provide an in-depth understandin… view at source ↗
Figure 5
Figure 5. Figure 5: Test accuracy versus token consumption on IFBench with Gemini 2.5 Flash. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces APEX, a framework for automatic prompt optimization that dynamically stratifies the development dataset into Easy, Hard, and Mixed tiers based on optimization lineage. By prioritizing the Mixed tier (instances with mixed LLM performance), it targets two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. Under a fixed budget of 5,000 evaluation calls, APEX is reported to outperform the initial prompt by 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B across IFBench, SimpleQA Verified, and FACTS Grounding.

Significance. If the lineage-based tiering is shown to isolate genuinely high-leverage subsets and the efficiency gains are reproducible, the work would meaningfully advance data-efficient prompt optimization, a practical bottleneck in evolutionary methods. The explicit focus on data usage alongside prompt search is a clear conceptual contribution, though the absence of validation for the core stratification mechanism limits current assessment of impact.

major comments (2)
  1. [Abstract / Methods (dynamic stratification)] The central efficiency claim (abstract) rests on the assertion that the Mixed tier 'identifies the data where the LLM has mixed performance' and thereby surfaces the addressable and rank-sensitive frontiers. No ablation, metric, or derivation is supplied showing that Mixed instances exhibit measurably higher mutation informativeness or rank sensitivity than random or static subsets; without this, performance gains under the 5,000-call budget cannot be confidently attributed to the claimed data-centric mechanism rather than implicit selection.
  2. [Abstract / Experimental results] The reported improvements (11.2% and 6.8%) are stated without accompanying experimental details, baseline definitions, error bars, or ablation results on the tiering component. This makes it impossible to verify whether the gains are load-bearing on the Mixed-tier prioritization or could arise from other factors in the evolutionary loop.
minor comments (1)
  1. [Abstract] The abstract supplies performance numbers but omits any description of the three benchmarks, the initial prompt, or how the 5,000-call budget is allocated across tiers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important gaps in validation and reporting that we will address through targeted revisions. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract / Methods (dynamic stratification)] The central efficiency claim (abstract) rests on the assertion that the Mixed tier 'identifies the data where the LLM has mixed performance' and thereby surfaces the addressable and rank-sensitive frontiers. No ablation, metric, or derivation is supplied showing that Mixed instances exhibit measurably higher mutation informativeness or rank sensitivity than random or static subsets; without this, performance gains under the 5,000-call budget cannot be confidently attributed to the claimed data-centric mechanism rather than implicit selection.

    Authors: We agree that the current manuscript lacks direct empirical validation isolating the properties of the Mixed tier. The methods section defines the lineage-based stratification and the prioritization rule, but does not include quantitative comparisons of mutation success rates or ranking correlation on Mixed versus random/static subsets. In the revision we will add a dedicated ablation subsection that reports (i) the fraction of successful mutations generated from each tier and (ii) Spearman rank correlation between per-instance scores on the tier and final prompt ranking accuracy, all under controlled budgets. This will allow readers to assess whether the observed gains are attributable to the data-centric mechanism. revision: yes

  2. Referee: [Abstract / Experimental results] The reported improvements (11.2% and 6.8%) are stated without accompanying experimental details, baseline definitions, error bars, or ablation results on the tiering component. This makes it impossible to verify whether the gains are load-bearing on the Mixed-tier prioritization or could arise from other factors in the evolutionary loop.

    Authors: The abstract necessarily summarizes results at a high level; the full manuscript contains the experimental protocol, baseline prompt definitions, and per-benchmark tables. However, we acknowledge that error bars, multiple-run statistics, and an explicit ablation isolating the tiering component are not presented with sufficient prominence. The revised version will (a) report mean and standard deviation across three independent runs for all main results, (b) add a table that ablates the Mixed-tier prioritization against a random-subset baseline under the same 5,000-call budget, and (c) move key experimental details from the appendix into the main experimental section for easier verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a methodological framework for dynamic dataset stratification into Easy/Hard/Mixed tiers based on optimization lineage, then claims empirical gains under a fixed 5,000-call budget. No equations, fitted parameters, or self-citations appear in the provided text. The central claims rest on experimental results rather than any derivation that reduces by construction to its own inputs or renames a fitted quantity as a prediction. The stratification is presented as an independent design choice whose validity is asserted via downstream performance, not via self-definition or load-bearing self-citation. This is the common case of a self-contained empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.1-grok · 5759 in / 1132 out tokens · 22511 ms · 2026-06-27T13:00:53.491802+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

  2. [2]

    Instructzero: Efficient instruction optimization for black-box large language models

    Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou. Instructzero: Efficient instruction optimization for black-box large language models. InInternational Conference on Machine Learning, pages 6503–6518. PMLR, 2024a. Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng...

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    URLhttps://openreview.net/forum?id=zPKeJAEo27. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261,

  4. [4]

    Rlprompt: Optimizing discrete text prompts with reinforcement learning

    Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 3369–3391,

  5. [5]

    Model performance-guided evaluation data selection for effective prompt optimization

    Ximing Dong, Shaowei Wang, Dayi Lin, and Ahmed Hassan. Model performance-guided evaluation data selection for effective prompt optimization. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2844–2859,

  6. [6]

    Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das

    URLhttps://openreview.net/forum?id=ZG3RaNIsO8. Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge.arXiv preprint arXiv:2509.07968,

  7. [7]

    Instruction induction: From few examples to natural language task descriptions

    Or Honovich, Uri Shaham, Samuel Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1935–1952,

  8. [8]

    Automatic engineering of long prompts

    Cho-Jui Hsieh, Si Si, Felix Yu, and Inderjit Dhillon. Automatic engineering of long prompts. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10672–10685,

  9. [9]

    The facts grounding leaderboard: Benchmarking llms’ ability to ground responses to long-form input.arXiv preprint arXiv:2501.03200,

    Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, et al. The facts grounding leaderboard: Benchmarking llms’ ability to ground responses to long-form input.arXiv preprint arXiv:2501.03200,

  10. [10]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wag- ner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphae- volve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

  11. [11]

    Optimizing instructions and demonstrations for multi-stage language model programs

    Krista Opsahl-Ong, Michael Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366,

  12. [12]

    gradient descent

    Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7957–7968,

  13. [13]

    Generalizing Verifiable Instruction Following

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

  14. [14]

    Autoprompt: Eliciting knowledge from language models with automatically generated prompts.arXiv preprint arXiv:2010.15980,

    Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts.arXiv preprint arXiv:2010.15980,

  15. [15]

    Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

    11 Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. InInternational Conference on Learning Representations, volume 2025, pages 10131–10165,

  16. [16]

    Gemma 3 Technical Report

    Team Gemma, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  17. [17]

    Maestro: Self-improving text-to-image generation via agent orchestration.arXiv preprint arXiv:2509.10704,

    Xingchen Wan, Han Zhou, Ruoxi Sun, Hootan Nakhost, Ke Jiang, Rajarishi Sinha, and Sercan Ö Arık. Maestro: Self-improving text-to-image generation via agent orchestration.arXiv preprint arXiv:2509.10704,

  18. [18]

    Data advisor: Dynamic data curation for safety alignment of large language models

    Fei Wang, Ninareh Mehrabi, Palash Goyal, Rahul Gupta, Kai-Wei Chang, and Aram Galstyan. Data advisor: Dynamic data curation for safety alignment of large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8089–8100, 2024a. Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, ...

  19. [19]

    Prompt engineering a prompt engineer

    Qinyuan Ye, Mohamed Ahmed, Reid Pryzant, and Fereshte Khani. Prompt engineering a prompt engineer. InFindings of the Association for Computational Linguistics: ACL 2024, pages 355–385,

  20. [20]

    G r o u n d i n g

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. InAdvances in Neural Information Processing Systems, volume 36, pages 55006–55021, 2023a. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Larg...