pith. sign in

arxiv: 2606.19605 · v2 · pith:ZARS46XXnew · submitted 2026-06-17 · 💻 cs.SE · cs.AI

FAPO: Fully Automated Prompt Optimization of Multi-Step LLM Pipelines

Pith reviewed 2026-06-26 19:48 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM pipeline optimizationprompt engineeringautomated diagnosismulti-step reasoningstructural editingbenchmark evaluationsecurity task improvement
0
0 comments X

The pith

FAPO automates inspection and scoped edits of multi-step LLM pipelines to beat prompt-only baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FAPO as a system that runs an LLM inside a fixed codebase to evaluate a pipeline, examine its intermediate outputs, attribute failures to specific steps, and then apply either prompt rewrites or limited structural edits before re-testing against a target score. It shows this loop produces higher accuracy than a GEPA baseline across six benchmarks and three models, with the largest lifts occurring precisely on the cases where the system escalates beyond prompts to change pipeline structure. A sympathetic reader would care because multi-step LLM systems routinely fail at the interfaces between retrieval, reasoning, and formatting steps, and manual tuning of those interfaces does not scale. The work therefore claims that delegating diagnosis and repair to an automated loop can systematically locate and remove those interface failures.

Core claim

FAPO lets Claude Code repeatedly evaluate a pipeline, inspect its intermediate results, diagnose whether a failure is prompt-level or structural, propose a scoped edit, and validate the variant against the score function; it prefers prompt edits and only escalates to structural changes when attribution shows a structural bottleneck. Across six benchmarks and three task models this procedure beats the GEPA baseline in 15 of 18 comparisons, with non-overlapping mean-plus-std-dev intervals in 11 cases and an average gain of 14.1 percentage points; on the six HoVer and IFBench runs that required structural edits the mean gain rises to 33.8 points. The same procedure also raises accuracy on the s

What carries the argument

An iterative diagnosis-and-edit loop that first attempts prompt-level changes and escalates to permitted structural changes only after attributing failure to a chain-level bottleneck.

If this is right

  • On tasks whose bottlenecks are prompt-level, FAPO still improves accuracy but by smaller margins than on tasks that need structural repair.
  • Security-oriented pipelines such as CVE-to-CWE mapping become more accurate without manual prompt engineering.
  • The method is model-agnostic in the sense that the same optimization loop works for three different task models.
  • When structural changes are triggered, the performance delta over prompt-only search more than doubles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the diagnosis step is reliable, the same loop could be pointed at any codebase that exposes intermediate outputs and a scalar score, not just the six benchmarks tested.
  • The escalation rule implies that purely prompt-based optimizers will systematically under-perform on pipelines whose errors are architectural rather than linguistic.
  • A cheaper or open-source substitute for the diagnosis LLM would let the same workflow run at lower cost on the same benchmarks.

Load-bearing premise

The LLM performing the diagnosis can correctly identify whether a failure is caused by a prompt or by pipeline structure and can propose edits that raise the target score without creating new undetected errors.

What would settle it

A controlled run on one of the six benchmarks in which every edit proposed by the optimizer either leaves the score unchanged or lowers it while the reported accuracy still rises.

read the original abstract

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present Fully Automated Prompt Optimization (FAPO), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces FAPO, a framework that uses Claude Code to automatically optimize multi-step LLM pipelines. It evaluates pipelines, inspects intermediate outputs, attributes failures to prompt or structural causes, proposes scoped edits (starting with prompts and escalating to structure only when needed), and iterates against a score function. Empirical claims state that across six benchmarks and three task models, FAPO outperforms the GEPA baseline in 15 of 18 comparisons, with mean gains of +14.1 pp overall and +33.8 pp in the six cases involving structural changes; additional gains are reported on security tasks such as CTIBench-RCM.

Significance. If the empirical results and the reliability of the automated diagnosis/editing loop hold after proper controls and verification, FAPO would represent a meaningful advance in automated optimization of multi-step LLM systems by addressing interactions across retrieval, reasoning, and formatting steps that prompt-only methods miss. The integration of failure attribution with scoped structural edits within a standardized codebase is a distinctive contribution relative to prior prompt optimization work.

major comments (2)
  1. [Abstract] Abstract: The headline numerical claims (15/18 wins, mean +14.1 pp gain, non-overlapping mean ± trial-std ranges in 11 cases, +33.8 pp on structural-escalation cases) are presented without any reported trial counts per comparison, definition of the score function, data splits, controls for optimizer LLM stochasticity, or statistical tests. These omissions make it impossible to evaluate whether the reported superiority is robust or could be driven by variability in the Claude Code optimizer itself.
  2. [Abstract] Abstract (FAPO framework description): The six structural-change wins on HoVer and IFBench, which drive the largest reported gains, rest on the unverified assumption that Claude Code can reliably inspect intermediate outputs, correctly attribute failures to prompt versus structural causes, and generate scoped edits that improve the target score without introducing new undetected errors. No ablation study, inter-rater agreement metric, or independent audit of the attribution accuracy is supplied, rendering these results vulnerable to the possibility that gains arise from stochastic optimizer behavior rather than the FAPO loop.
minor comments (1)
  1. The manuscript should clarify whether the standardized codebase and any evaluation harness are released, as this directly affects reproducibility of the reported pipeline optimizations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's careful reading and constructive suggestions for improving the clarity and verifiability of our results. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline numerical claims (15/18 wins, mean +14.1 pp gain, non-overlapping mean ± trial-std ranges in 11 cases, +33.8 pp on structural-escalation cases) are presented without any reported trial counts per comparison, definition of the score function, data splits, controls for optimizer LLM stochasticity, or statistical tests. These omissions make it impossible to evaluate whether the reported superiority is robust or could be driven by variability in the Claude Code optimizer itself.

    Authors: We agree that the abstract would benefit from additional methodological details to allow readers to better assess the robustness of the claims. We will revise the abstract to include the number of trials per comparison, a brief definition of the score function, information on data splits, controls for the optimizer LLM's stochasticity, and any statistical tests. The full details are elaborated in the experimental sections of the manuscript. revision: yes

  2. Referee: [Abstract] Abstract (FAPO framework description): The six structural-change wins on HoVer and IFBench, which drive the largest reported gains, rest on the unverified assumption that Claude Code can reliably inspect intermediate outputs, correctly attribute failures to prompt versus structural causes, and generate scoped edits that improve the target score without introducing new undetected errors. No ablation study, inter-rater agreement metric, or independent audit of the attribution accuracy is supplied, rendering these results vulnerable to the possibility that gains arise from stochastic optimizer behavior rather than the FAPO loop.

    Authors: While the end-to-end performance gains provide indirect support for the effectiveness of the attribution and editing process, we acknowledge that direct verification through ablation or audit would strengthen the claims. We will add an ablation study isolating the impact of structural changes and include examples of the diagnosis and attribution process in the revised manuscript to address concerns about potential stochastic effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons only

full rationale

The paper describes an empirical optimization framework (FAPO) that runs Claude Code to edit prompts and (when needed) structure, then reports measured accuracy gains versus the external baseline GEPA on six benchmarks. No equations, fitted parameters, first-principles derivations, or self-referential definitions appear in the provided text. Performance numbers are direct experimental outcomes, not quantities defined in terms of themselves or forced by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central claims therefore remain independent of the circularity patterns enumerated in the instructions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; the central process rests on the unverified assumption that the optimizer LLM can perform reliable attribution and scoped editing.

axioms (1)
  • domain assumption An LLM (Claude Code) can inspect intermediate outputs and correctly diagnose whether failures are prompt-level or structural.
    The escalation logic and reported gains depend on this diagnostic capability.
invented entities (1)
  • FAPO framework no independent evidence
    purpose: Automated end-to-end optimization of multi-step LLM pipelines
    New named system introduced to perform the described loop.

pith-pipeline@v0.9.1-grok · 5831 in / 1430 out tokens · 33209 ms · 2026-06-26T19:48:42.977141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 9 canonical work pages

  1. [1]

    GEPA: Reflective prompt evolution can outperform reinforcement learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InThe Fourteenth International...

  2. [2]

    CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence

    Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, and Nidhi Rastogi. CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 50805–50825. Curran Associates, Inc., 2024. doi: 10.52202/...

  3. [3]

    Claude Code: An agentic coding tool, 2025.https://docs.anthropic.com/en/docs/claude-code

    Anthropic. Claude Code: An agentic coding tool, 2025.https://docs.anthropic.com/en/docs/claude-code

  4. [4]

    AdaEvolve: Adaptive LLM-driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve: Adaptive LLM-driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026

  5. [5]

    Pappas, and Eric Wong

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

  6. [6]

    Beyond benchmarks: MathArena as an evaluation platform for mathematics with LLMs, 2026

    Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: MathArena as an evaluation platform for mathematics with LLMs, 2026. URLhttps://arxiv.org/abs/2605.00674

  7. [7]

    PromptBreeder: self-referential self-improvement via prompt evolution

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. PromptBreeder: self-referential self-improvement via prompt evolution. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  8. [8]

    Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

    Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. URLhttps://arxiv.org/abs/2503.19786

  9. [9]

    Connecting large language models with evolutionary algorithms yields powerful prompt optimizers

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=ZG3RaNIsO8

  10. [10]

    Best-of-N jailbreaking

    John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-N jailbreaking. InAdvances in Neural Information Processing Systems, 2025. URL https://proceedings.neurips. cc/paper_files/paper/2025/hash/69f3eb242c7c9df9ea2f2b66ea8b3c0f-Abstract-Conference.html. 10

  11. [11]

    HoVer: A dataset for many-hop fact extraction and claim verification

    Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. HoVer: A dataset for many-hop fact extraction and claim verification. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3441–3460, Online, November 2020. Association for Computational Linguis...

  12. [12]

    autoresearch: AI agents running research on single-GPU nanochat training automatically, 2026

    Andrej Karpathy. autoresearch: AI agents running research on single-GPU nanochat training automatically, 2026. https://github.c om/karpathy/autoresearch

  13. [13]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Represent...

  14. [14]

    LangGraph: Building stateful, multi-agent applications with LLMs, 2024

    LangChain. LangGraph: Building stateful, multi-agent applications with LLMs, 2024. https://github.com/langchain-ai/langgr aph

  15. [15]

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda R...

  16. [16]

    EvoX: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026

    Shu Liu, Shubham Agarwal, et al. EvoX: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026

  17. [17]

    AgentBench: Evaluating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...

  18. [18]

    AutoDAN: Generating stealthy jailbreak prompts on aligned large language models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7Jwp w4qKkb

  19. [19]

    Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 2024

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 2024

  20. [20]

    Introducing GPT-4.1 in the API, 2025.https://openai.com/index/gpt-4-1/

    OpenAI. Introducing GPT-4.1 in the API, 2025.https://openai.com/index/gpt-4-1/

  21. [21]

    Introducing GPT-5 for developers, 2025.https://openai.com/index/introducing-gpt-5-for-developers/

    OpenAI. Introducing GPT-5 for developers, 2025.https://openai.com/index/introducing-gpt-5-for-developers/

  22. [22]

    Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

    Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 93...

  23. [23]

    Capability-based scaling trends for LLM-based red-teaming

    Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, and Jonas Geiping. Capability-based scaling trends for LLM-based red-teaming. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?i d=1InFGGz1D5

  24. [24]

    Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs.arXiv preprint arXiv:2603.24511, 2026

    Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs.arXiv preprint arXiv:2603.24511, 2026

  25. [25]

    Generalizing verifiable instruction following

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URLhttps://openreview.net/forum?id=yfYgwjj5F8

  26. [26]

    Pappas, Amin Karbasi, and Hamed Hassani

    Mahdi Sabbaghi, Paul Kassianik, George J. Pappas, Amin Karbasi, and Hamed Hassani. Adversarial reasoning at jailbreaking time. In Forty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=aWd7mL5U9Q

  27. [27]

    and Wallace, Eric and Singh, Sameer

    Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online, ...

  28. [28]

    PAPILLON: Privacy preservation from Internet-based and local language model ensembles

    Li Siyan, Vethavikashini Chithrra Raghuram, Omar Khattab, Julia Hirschberg, and Zhou Yu. PAPILLON: Privacy preservation from Internet-based and local language model ensembles. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human...

  29. [29]

    Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amand...

  30. [30]

    Universal Adversarial Triggers for Attacking and Analyzing NLP

    Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Proce...

  31. [31]

    Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct technical report.arXiv preprint arXiv:2508.01059, 2025

    Sajana Weerawardhena, Paul Kassianik, Blaine Nelson, Baturay Saglam, Anu Vellore, Aman Priyanshu, Supriti Vijay, Massimo Aufiero, Arthur Goldblatt, Fraser Burch, Ed Li, Jianliang He, Dhruv Kedia, Kojin Oshiba, Zhuoran Yang, Yaron Singer, and Amin Karbasi. Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct technical report.arXiv preprint arXiv:2508.01059, 2025...

  32. [32]

    LiveBench: A challenging, contamination-limited LLM benchmark

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-limited LLM benchmark. In...

  33. [33]

    Large language models as optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Bb4VGOWELI

  34. [34]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processin...

  35. [35]

    Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B technical report

    Zhuoran Yang, Ed Li, Jianliang He, Aman Priyanshu, Baturay Saglam, Paul Kassianik, Sajana Weerawardhena, Anu Vellore, Blaine Nelson, Neusha Javidnia, Arthur Goldblatt, Fraser Burch, Avi Zohary, Assaf Eisenman, Mahdi Sabbaghi, Supriti Vijay, Rahim Dharssi, Dhruv Kedia, Kojin Oshiba, Yaron Singer, and Amin Karbasi. Llama-3.1-FoundationAI-SecurityLLM-Reasoni...

  36. [36]

    differentiation

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

  37. [37]

    Large language models are human-level prompt engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview .net/forum?id=92gvk82DE-. 12

  38. [38]

    standard NVD abstraction level

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A System Implementation Details This appendix gives the technical details that are summarized at a higher level in Section 2. A.1 Runtime and Task Works...

  39. [39]

    ‘ question ‘ ( str ) :

  40. [40]

    ‘ summary_1 ‘ ( str ) :

  41. [41]

    ‘ summary_2 ‘ ( str ) : Your output fields are :

  42. [42]

    ‘ reasoning ‘ ( str ) :

  43. [43]

    Optimized (variant-003, 70.3% val EM): System : You answer multi - hop q u e s t i o n s with the SHO RT EST po ss ibl e answer

    ‘ answer ‘ ( str ) : [...] In adh er ing to this structure , your o b j e c t i v e is : Given the fields ‘ question ‘ , ‘ summary_1 ‘ , ‘ summary_2 ‘ , produce the fields ‘ answer ‘. Optimized (variant-003, 70.3% val EM): System : You answer multi - hop q u e s t i o n s with the SHO RT EST po ss ibl e answer . CR IT ICA L RULES :

  44. [44]

    unknown

    MUST ALWAYS provide an answer . NEVER say " unknown " , " none " , " N / A " , or " not enough i n f o r m a t i o n "

  45. [45]

    If s u m m a r i e s contain partial info , use what you have to make your best i n f e r e n c e

  46. [46]

    yes " or

    If the que st ion asks for a c o m p a r i s o n and you only have data for one entity , answer with that entity . ANSWER FORMAT RULES ( follow EXACTLY ) : - Output ONLY the entity name , number , date , or yes / no . - NEVER output a full se nte nc e as the answer . - For yes / no q u e s t i o n s : " yes " or " no " ( l o w e r c a s e ) . - For " who ...