FAPO: Fully Automated Prompt Optimization of Multi-Step LLM Pipelines

Aman Priyanshu; Amin Karbasi; Baturay Saglam; Blaine Nelson; Huaibo Zhao; Paul Kassianik; Supriti Vijay

arxiv: 2606.19605 · v2 · pith:ZARS46XXnew · submitted 2026-06-17 · 💻 cs.SE · cs.AI

FAPO: Fully Automated Prompt Optimization of Multi-Step LLM Pipelines

Paul Kassianik , Baturay Saglam , Huaibo Zhao , Blaine Nelson , Supriti Vijay , Aman Priyanshu , Amin Karbasi This is my paper

Pith reviewed 2026-06-26 19:48 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM pipeline optimizationprompt engineeringautomated diagnosismulti-step reasoningstructural editingbenchmark evaluationsecurity task improvement

0 comments

The pith

FAPO automates inspection and scoped edits of multi-step LLM pipelines to beat prompt-only baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FAPO as a system that runs an LLM inside a fixed codebase to evaluate a pipeline, examine its intermediate outputs, attribute failures to specific steps, and then apply either prompt rewrites or limited structural edits before re-testing against a target score. It shows this loop produces higher accuracy than a GEPA baseline across six benchmarks and three models, with the largest lifts occurring precisely on the cases where the system escalates beyond prompts to change pipeline structure. A sympathetic reader would care because multi-step LLM systems routinely fail at the interfaces between retrieval, reasoning, and formatting steps, and manual tuning of those interfaces does not scale. The work therefore claims that delegating diagnosis and repair to an automated loop can systematically locate and remove those interface failures.

Core claim

FAPO lets Claude Code repeatedly evaluate a pipeline, inspect its intermediate results, diagnose whether a failure is prompt-level or structural, propose a scoped edit, and validate the variant against the score function; it prefers prompt edits and only escalates to structural changes when attribution shows a structural bottleneck. Across six benchmarks and three task models this procedure beats the GEPA baseline in 15 of 18 comparisons, with non-overlapping mean-plus-std-dev intervals in 11 cases and an average gain of 14.1 percentage points; on the six HoVer and IFBench runs that required structural edits the mean gain rises to 33.8 points. The same procedure also raises accuracy on the s

What carries the argument

An iterative diagnosis-and-edit loop that first attempts prompt-level changes and escalates to permitted structural changes only after attributing failure to a chain-level bottleneck.

If this is right

On tasks whose bottlenecks are prompt-level, FAPO still improves accuracy but by smaller margins than on tasks that need structural repair.
Security-oriented pipelines such as CVE-to-CWE mapping become more accurate without manual prompt engineering.
The method is model-agnostic in the sense that the same optimization loop works for three different task models.
When structural changes are triggered, the performance delta over prompt-only search more than doubles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the diagnosis step is reliable, the same loop could be pointed at any codebase that exposes intermediate outputs and a scalar score, not just the six benchmarks tested.
The escalation rule implies that purely prompt-based optimizers will systematically under-perform on pipelines whose errors are architectural rather than linguistic.
A cheaper or open-source substitute for the diagnosis LLM would let the same workflow run at lower cost on the same benchmarks.

Load-bearing premise

The LLM performing the diagnosis can correctly identify whether a failure is caused by a prompt or by pipeline structure and can propose edits that raise the target score without creating new undetected errors.

What would settle it

A controlled run on one of the six benchmarks in which every edit proposed by the optimizer either leaves the score unchanged or lowers it while the reported accuracy still rises.

read the original abstract

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present Fully Automated Prompt Optimization (FAPO), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FAPO adds a conditional prompt-then-structure optimization loop inside a fixed codebase, but the gains rest on untested claims about the LLM's diagnosis accuracy.

read the letter

FAPO lets an LLM inspect pipeline outputs, attribute failures to prompt or structure, and apply scoped edits, escalating to structural changes only when prompt search stalls. The abstract reports it beating GEPA in 15 of 18 model-benchmark pairs, with bigger margins on the six cases that reached structural edits.

The concrete combination of diagnosis, prompt-first search, and conditional structural edits in one standardized setup is the main new element. It targets a common pain point in chained LLM systems where isolated prompt tuning misses interactions between steps. The security-task results on CTIBench-RCM add a practical angle, even if those appear to be prompt-only runs.

The numbers are hard to weigh because the abstract supplies no trial counts, variance details beyond the reported standard deviations, score-function definitions, or controls for the optimizer LLM's own variability. More importantly, the method depends on Claude Code correctly diagnosing intermediate outputs and proposing edits that improve the target without new undetected errors. No audit, agreement check, or ablation tests that step, so it is unclear whether the reported edges come from the framework or from the base model's stochastic behavior.

The security lifts are modest and do not exercise the escalation logic, which limits how much they support the full claim. The evaluation therefore leaves the central mechanism unverified.

Engineers who run multi-step LLM pipelines for applied work would find the most use here, especially if they already work inside similar codebases. The paper deserves a serious referee because the problem is real and the proposed loop is concrete, but any review would need to press on the missing validation of the diagnosis step and the statistical reporting before the results can be treated as reliable.

Referee Report

2 major / 1 minor

Summary. The paper introduces FAPO, a framework that uses Claude Code to automatically optimize multi-step LLM pipelines. It evaluates pipelines, inspects intermediate outputs, attributes failures to prompt or structural causes, proposes scoped edits (starting with prompts and escalating to structure only when needed), and iterates against a score function. Empirical claims state that across six benchmarks and three task models, FAPO outperforms the GEPA baseline in 15 of 18 comparisons, with mean gains of +14.1 pp overall and +33.8 pp in the six cases involving structural changes; additional gains are reported on security tasks such as CTIBench-RCM.

Significance. If the empirical results and the reliability of the automated diagnosis/editing loop hold after proper controls and verification, FAPO would represent a meaningful advance in automated optimization of multi-step LLM systems by addressing interactions across retrieval, reasoning, and formatting steps that prompt-only methods miss. The integration of failure attribution with scoped structural edits within a standardized codebase is a distinctive contribution relative to prior prompt optimization work.

major comments (2)

[Abstract] Abstract: The headline numerical claims (15/18 wins, mean +14.1 pp gain, non-overlapping mean ± trial-std ranges in 11 cases, +33.8 pp on structural-escalation cases) are presented without any reported trial counts per comparison, definition of the score function, data splits, controls for optimizer LLM stochasticity, or statistical tests. These omissions make it impossible to evaluate whether the reported superiority is robust or could be driven by variability in the Claude Code optimizer itself.
[Abstract] Abstract (FAPO framework description): The six structural-change wins on HoVer and IFBench, which drive the largest reported gains, rest on the unverified assumption that Claude Code can reliably inspect intermediate outputs, correctly attribute failures to prompt versus structural causes, and generate scoped edits that improve the target score without introducing new undetected errors. No ablation study, inter-rater agreement metric, or independent audit of the attribution accuracy is supplied, rendering these results vulnerable to the possibility that gains arise from stochastic optimizer behavior rather than the FAPO loop.

minor comments (1)

The manuscript should clarify whether the standardized codebase and any evaluation harness are released, as this directly affects reproducibility of the reported pipeline optimizations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's careful reading and constructive suggestions for improving the clarity and verifiability of our results. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The headline numerical claims (15/18 wins, mean +14.1 pp gain, non-overlapping mean ± trial-std ranges in 11 cases, +33.8 pp on structural-escalation cases) are presented without any reported trial counts per comparison, definition of the score function, data splits, controls for optimizer LLM stochasticity, or statistical tests. These omissions make it impossible to evaluate whether the reported superiority is robust or could be driven by variability in the Claude Code optimizer itself.

Authors: We agree that the abstract would benefit from additional methodological details to allow readers to better assess the robustness of the claims. We will revise the abstract to include the number of trials per comparison, a brief definition of the score function, information on data splits, controls for the optimizer LLM's stochasticity, and any statistical tests. The full details are elaborated in the experimental sections of the manuscript. revision: yes
Referee: [Abstract] Abstract (FAPO framework description): The six structural-change wins on HoVer and IFBench, which drive the largest reported gains, rest on the unverified assumption that Claude Code can reliably inspect intermediate outputs, correctly attribute failures to prompt versus structural causes, and generate scoped edits that improve the target score without introducing new undetected errors. No ablation study, inter-rater agreement metric, or independent audit of the attribution accuracy is supplied, rendering these results vulnerable to the possibility that gains arise from stochastic optimizer behavior rather than the FAPO loop.

Authors: While the end-to-end performance gains provide indirect support for the effectiveness of the attribution and editing process, we acknowledge that direct verification through ablation or audit would strengthen the claims. We will add an ablation study isolating the impact of structural changes and include examples of the diagnosis and attribution process in the revised manuscript to address concerns about potential stochastic effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons only

full rationale

The paper describes an empirical optimization framework (FAPO) that runs Claude Code to edit prompts and (when needed) structure, then reports measured accuracy gains versus the external baseline GEPA on six benchmarks. No equations, fitted parameters, first-principles derivations, or self-referential definitions appear in the provided text. Performance numbers are direct experimental outcomes, not quantities defined in terms of themselves or forced by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central claims therefore remain independent of the circularity patterns enumerated in the instructions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; the central process rests on the unverified assumption that the optimizer LLM can perform reliable attribution and scoped editing.

axioms (1)

domain assumption An LLM (Claude Code) can inspect intermediate outputs and correctly diagnose whether failures are prompt-level or structural.
The escalation logic and reported gains depend on this diagnostic capability.

invented entities (1)

FAPO framework no independent evidence
purpose: Automated end-to-end optimization of multi-step LLM pipelines
New named system introduced to perform the described loop.

pith-pipeline@v0.9.1-grok · 5831 in / 1430 out tokens · 33209 ms · 2026-06-26T19:48:42.977141+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 9 canonical work pages

[1]

GEPA: Reflective prompt evolution can outperform reinforcement learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InThe Fourteenth International...

2026
[2]

CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence

Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, and Nidhi Rastogi. CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 50805–50825. Curran Associates, Inc., 2024. doi: 10.52202/...

work page doi:10.52202/079017-1607 2024
[3]

Claude Code: An agentic coding tool, 2025.https://docs.anthropic.com/en/docs/claude-code

Anthropic. Claude Code: An agentic coding tool, 2025.https://docs.anthropic.com/en/docs/claude-code

2025
[4]

AdaEvolve: Adaptive LLM-driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve: Adaptive LLM-driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

arXiv 2026
[5]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

Pith/arXiv arXiv 2023
[6]

Beyond benchmarks: MathArena as an evaluation platform for mathematics with LLMs, 2026

Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: MathArena as an evaluation platform for mathematics with LLMs, 2026. URLhttps://arxiv.org/abs/2605.00674

Pith/arXiv arXiv 2026
[7]

PromptBreeder: self-referential self-improvement via prompt evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. PromptBreeder: self-referential self-improvement via prompt evolution. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024
[8]

Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. URLhttps://arxiv.org/abs/2503.19786

Pith/arXiv arXiv 2025
[9]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=ZG3RaNIsO8

2024
[10]

Best-of-N jailbreaking

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-N jailbreaking. InAdvances in Neural Information Processing Systems, 2025. URL https://proceedings.neurips. cc/paper_files/paper/2025/hash/69f3eb242c7c9df9ea2f2b66ea8b3c0f-Abstract-Conference.html. 10

2025
[11]

HoVer: A dataset for many-hop fact extraction and claim verification

Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. HoVer: A dataset for many-hop fact extraction and claim verification. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3441–3460, Online, November 2020. Association for Computational Linguis...

work page doi:10.18653/v1/2020.fin 2020
[12]

autoresearch: AI agents running research on single-GPU nanochat training automatically, 2026

Andrej Karpathy. autoresearch: AI agents running research on single-GPU nanochat training automatically, 2026. https://github.c om/karpathy/autoresearch

2026
[13]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Represent...

2024
[14]

LangGraph: Building stateful, multi-agent applications with LLMs, 2024

LangChain. LangGraph: Building stateful, multi-agent applications with LLMs, 2024. https://github.com/langchain-ai/langgr aph

2024
[15]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda R...

2023
[16]

EvoX: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026

Shu Liu, Shubham Agarwal, et al. EvoX: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026

arXiv 2026
[17]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...

2024
[18]

AutoDAN: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7Jwp w4qKkb

2024
[19]

Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 2024

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 2024

2024
[20]

Introducing GPT-4.1 in the API, 2025.https://openai.com/index/gpt-4-1/

OpenAI. Introducing GPT-4.1 in the API, 2025.https://openai.com/index/gpt-4-1/

2025
[21]

Introducing GPT-5 for developers, 2025.https://openai.com/index/introducing-gpt-5-for-developers/

OpenAI. Introducing GPT-5 for developers, 2025.https://openai.com/index/introducing-gpt-5-for-developers/

2025
[22]

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 93...

work page doi:10.18653/v1/2024.emnlp-main.525 2024
[23]

Capability-based scaling trends for LLM-based red-teaming

Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, and Jonas Geiping. Capability-based scaling trends for LLM-based red-teaming. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?i d=1InFGGz1D5

2026
[24]

Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs.arXiv preprint arXiv:2603.24511, 2026

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs.arXiv preprint arXiv:2603.24511, 2026

Pith/arXiv arXiv 2026
[25]

Generalizing verifiable instruction following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URLhttps://openreview.net/forum?id=yfYgwjj5F8

2026
[26]

Pappas, Amin Karbasi, and Hamed Hassani

Mahdi Sabbaghi, Paul Kassianik, George J. Pappas, Amin Karbasi, and Hamed Hassani. Adversarial reasoning at jailbreaking time. In Forty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=aWd7mL5U9Q

2025
[27]

and Wallace, Eric and Singh, Sameer

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online, ...

work page doi:10.18653/v1/2020.emnlp-main.346 2020
[28]

PAPILLON: Privacy preservation from Internet-based and local language model ensembles

Li Siyan, Vethavikashini Chithrra Raghuram, Omar Khattab, Julia Hirschberg, and Zhou Yu. PAPILLON: Privacy preservation from Internet-based and local language model ensembles. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human...

work page doi:10.18653/v1/2025.naacl-long.173 2025
[29]

Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amand...

2023
[30]

Universal Adversarial Triggers for Attacking and Analyzing NLP

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Proce...

work page doi:10.18653/v1/d19-1221 2019
[31]

Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct technical report.arXiv preprint arXiv:2508.01059, 2025

Sajana Weerawardhena, Paul Kassianik, Blaine Nelson, Baturay Saglam, Anu Vellore, Aman Priyanshu, Supriti Vijay, Massimo Aufiero, Arthur Goldblatt, Fraser Burch, Ed Li, Jianliang He, Dhruv Kedia, Kojin Oshiba, Zhuoran Yang, Yaron Singer, and Amin Karbasi. Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct technical report.arXiv preprint arXiv:2508.01059, 2025...

work page doi:10.48550/arxiv.2508.01059 2025
[32]

LiveBench: A challenging, contamination-limited LLM benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-limited LLM benchmark. In...

2025
[33]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Bb4VGOWELI

2024
[34]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processin...

work page doi:10.18653/v1/d18-1259 2018
[35]

Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B technical report

Zhuoran Yang, Ed Li, Jianliang He, Aman Priyanshu, Baturay Saglam, Paul Kassianik, Sajana Weerawardhena, Anu Vellore, Blaine Nelson, Neusha Javidnia, Arthur Goldblatt, Fraser Burch, Avi Zohary, Assaf Eisenman, Mahdi Sabbaghi, Supriti Vijay, Rahim Dharssi, Dhruv Kedia, Kojin Oshiba, Yaron Singer, and Amin Karbasi. Llama-3.1-FoundationAI-SecurityLLM-Reasoni...

work page doi:10.48550/arxiv.2601.21051 2026
[36]

differentiation

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

Pith/arXiv arXiv 2024
[37]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview .net/forum?id=92gvk82DE-. 12

2023
[38]

standard NVD abstraction level

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A System Implementation Details This appendix gives the technical details that are summarized at a higher level in Section 2. A.1 Runtime and Task Works...

Pith/arXiv arXiv 2023
[39]

‘ question ‘ ( str ) :
[40]

‘ summary_1 ‘ ( str ) :
[41]

‘ summary_2 ‘ ( str ) : Your output fields are :
[42]

‘ reasoning ‘ ( str ) :
[43]

Optimized (variant-003, 70.3% val EM): System : You answer multi - hop q u e s t i o n s with the SHO RT EST po ss ibl e answer

‘ answer ‘ ( str ) : [...] In adh er ing to this structure , your o b j e c t i v e is : Given the fields ‘ question ‘ , ‘ summary_1 ‘ , ‘ summary_2 ‘ , produce the fields ‘ answer ‘. Optimized (variant-003, 70.3% val EM): System : You answer multi - hop q u e s t i o n s with the SHO RT EST po ss ibl e answer . CR IT ICA L RULES :
[44]

unknown

MUST ALWAYS provide an answer . NEVER say " unknown " , " none " , " N / A " , or " not enough i n f o r m a t i o n "
[45]

If s u m m a r i e s contain partial info , use what you have to make your best i n f e r e n c e
[46]

yes " or

If the que st ion asks for a c o m p a r i s o n and you only have data for one entity , answer with that entity . ANSWER FORMAT RULES ( follow EXACTLY ) : - Output ONLY the entity name , number , date , or yes / no . - NEVER output a full se nte nc e as the answer . - For yes / no q u e s t i o n s : " yes " or " no " ( l o w e r c a s e ) . - For " who ...

[1] [1]

GEPA: Reflective prompt evolution can outperform reinforcement learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InThe Fourteenth International...

2026

[2] [2]

CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence

Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, and Nidhi Rastogi. CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 50805–50825. Curran Associates, Inc., 2024. doi: 10.52202/...

work page doi:10.52202/079017-1607 2024

[3] [3]

Claude Code: An agentic coding tool, 2025.https://docs.anthropic.com/en/docs/claude-code

Anthropic. Claude Code: An agentic coding tool, 2025.https://docs.anthropic.com/en/docs/claude-code

2025

[4] [4]

AdaEvolve: Adaptive LLM-driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve: Adaptive LLM-driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

arXiv 2026

[5] [5]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

Pith/arXiv arXiv 2023

[6] [6]

Beyond benchmarks: MathArena as an evaluation platform for mathematics with LLMs, 2026

Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: MathArena as an evaluation platform for mathematics with LLMs, 2026. URLhttps://arxiv.org/abs/2605.00674

Pith/arXiv arXiv 2026

[7] [7]

PromptBreeder: self-referential self-improvement via prompt evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. PromptBreeder: self-referential self-improvement via prompt evolution. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024

[8] [8]

Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. URLhttps://arxiv.org/abs/2503.19786

Pith/arXiv arXiv 2025

[9] [9]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=ZG3RaNIsO8

2024

[10] [10]

Best-of-N jailbreaking

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-N jailbreaking. InAdvances in Neural Information Processing Systems, 2025. URL https://proceedings.neurips. cc/paper_files/paper/2025/hash/69f3eb242c7c9df9ea2f2b66ea8b3c0f-Abstract-Conference.html. 10

2025

[11] [11]

HoVer: A dataset for many-hop fact extraction and claim verification

Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. HoVer: A dataset for many-hop fact extraction and claim verification. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3441–3460, Online, November 2020. Association for Computational Linguis...

work page doi:10.18653/v1/2020.fin 2020

[12] [12]

autoresearch: AI agents running research on single-GPU nanochat training automatically, 2026

Andrej Karpathy. autoresearch: AI agents running research on single-GPU nanochat training automatically, 2026. https://github.c om/karpathy/autoresearch

2026

[13] [13]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Represent...

2024

[14] [14]

LangGraph: Building stateful, multi-agent applications with LLMs, 2024

LangChain. LangGraph: Building stateful, multi-agent applications with LLMs, 2024. https://github.com/langchain-ai/langgr aph

2024

[15] [15]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda R...

2023

[16] [16]

EvoX: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026

Shu Liu, Shubham Agarwal, et al. EvoX: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026

arXiv 2026

[17] [17]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...

2024

[18] [18]

AutoDAN: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7Jwp w4qKkb

2024

[19] [19]

Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 2024

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 2024

2024

[20] [20]

Introducing GPT-4.1 in the API, 2025.https://openai.com/index/gpt-4-1/

OpenAI. Introducing GPT-4.1 in the API, 2025.https://openai.com/index/gpt-4-1/

2025

[21] [21]

Introducing GPT-5 for developers, 2025.https://openai.com/index/introducing-gpt-5-for-developers/

OpenAI. Introducing GPT-5 for developers, 2025.https://openai.com/index/introducing-gpt-5-for-developers/

2025

[22] [22]

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 93...

work page doi:10.18653/v1/2024.emnlp-main.525 2024

[23] [23]

Capability-based scaling trends for LLM-based red-teaming

Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, and Jonas Geiping. Capability-based scaling trends for LLM-based red-teaming. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?i d=1InFGGz1D5

2026

[24] [24]

Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs.arXiv preprint arXiv:2603.24511, 2026

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs.arXiv preprint arXiv:2603.24511, 2026

Pith/arXiv arXiv 2026

[25] [25]

Generalizing verifiable instruction following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URLhttps://openreview.net/forum?id=yfYgwjj5F8

2026

[26] [26]

Pappas, Amin Karbasi, and Hamed Hassani

Mahdi Sabbaghi, Paul Kassianik, George J. Pappas, Amin Karbasi, and Hamed Hassani. Adversarial reasoning at jailbreaking time. In Forty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=aWd7mL5U9Q

2025

[27] [27]

and Wallace, Eric and Singh, Sameer

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online, ...

work page doi:10.18653/v1/2020.emnlp-main.346 2020

[28] [28]

PAPILLON: Privacy preservation from Internet-based and local language model ensembles

Li Siyan, Vethavikashini Chithrra Raghuram, Omar Khattab, Julia Hirschberg, and Zhou Yu. PAPILLON: Privacy preservation from Internet-based and local language model ensembles. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human...

work page doi:10.18653/v1/2025.naacl-long.173 2025

[29] [29]

Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amand...

2023

[30] [30]

Universal Adversarial Triggers for Attacking and Analyzing NLP

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Proce...

work page doi:10.18653/v1/d19-1221 2019

[31] [31]

Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct technical report.arXiv preprint arXiv:2508.01059, 2025

Sajana Weerawardhena, Paul Kassianik, Blaine Nelson, Baturay Saglam, Anu Vellore, Aman Priyanshu, Supriti Vijay, Massimo Aufiero, Arthur Goldblatt, Fraser Burch, Ed Li, Jianliang He, Dhruv Kedia, Kojin Oshiba, Zhuoran Yang, Yaron Singer, and Amin Karbasi. Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct technical report.arXiv preprint arXiv:2508.01059, 2025...

work page doi:10.48550/arxiv.2508.01059 2025

[32] [32]

LiveBench: A challenging, contamination-limited LLM benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-limited LLM benchmark. In...

2025

[33] [33]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Bb4VGOWELI

2024

[34] [34]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processin...

work page doi:10.18653/v1/d18-1259 2018

[35] [35]

Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B technical report

Zhuoran Yang, Ed Li, Jianliang He, Aman Priyanshu, Baturay Saglam, Paul Kassianik, Sajana Weerawardhena, Anu Vellore, Blaine Nelson, Neusha Javidnia, Arthur Goldblatt, Fraser Burch, Avi Zohary, Assaf Eisenman, Mahdi Sabbaghi, Supriti Vijay, Rahim Dharssi, Dhruv Kedia, Kojin Oshiba, Yaron Singer, and Amin Karbasi. Llama-3.1-FoundationAI-SecurityLLM-Reasoni...

work page doi:10.48550/arxiv.2601.21051 2026

[36] [36]

differentiation

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

Pith/arXiv arXiv 2024

[37] [37]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview .net/forum?id=92gvk82DE-. 12

2023

[38] [38]

standard NVD abstraction level

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A System Implementation Details This appendix gives the technical details that are summarized at a higher level in Section 2. A.1 Runtime and Task Works...

Pith/arXiv arXiv 2023

[39] [39]

‘ question ‘ ( str ) :

[40] [40]

‘ summary_1 ‘ ( str ) :

[41] [41]

‘ summary_2 ‘ ( str ) : Your output fields are :

[42] [42]

‘ reasoning ‘ ( str ) :

[43] [43]

Optimized (variant-003, 70.3% val EM): System : You answer multi - hop q u e s t i o n s with the SHO RT EST po ss ibl e answer

‘ answer ‘ ( str ) : [...] In adh er ing to this structure , your o b j e c t i v e is : Given the fields ‘ question ‘ , ‘ summary_1 ‘ , ‘ summary_2 ‘ , produce the fields ‘ answer ‘. Optimized (variant-003, 70.3% val EM): System : You answer multi - hop q u e s t i o n s with the SHO RT EST po ss ibl e answer . CR IT ICA L RULES :

[44] [44]

unknown

MUST ALWAYS provide an answer . NEVER say " unknown " , " none " , " N / A " , or " not enough i n f o r m a t i o n "

[45] [45]

If s u m m a r i e s contain partial info , use what you have to make your best i n f e r e n c e

[46] [46]

yes " or

If the que st ion asks for a c o m p a r i s o n and you only have data for one entity , answer with that entity . ANSWER FORMAT RULES ( follow EXACTLY ) : - Output ONLY the entity name , number , date , or yes / no . - NEVER output a full se nte nc e as the answer . - For yes / no q u e s t i o n s : " yes " or " no " ( l o w e r c a s e ) . - For " who ...