pith. machine review for the scientific record. sign in

arxiv: 2507.19457 · v2 · submitted 2025-07-25 · 💻 cs.CL · cs.AI· cs.LG· cs.SE

Recognition: 3 theorem links

· Lean Theorem

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 07:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.SE
keywords prompt optimizationnatural language reflectionreinforcement learningLLM adaptationevolutionary searchPareto optimization
0
0 comments X

The pith

Natural language reflection on a few trajectories lets prompt evolution outperform RL with up to 35 times fewer rollouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that language offers LLMs a richer learning signal than the sparse scalar rewards used in reinforcement learning methods like GRPO. It introduces GEPA, a prompt optimizer that samples trajectories from any AI system, reflects on them in natural language to spot problems, proposes prompt updates, and merges lessons from its strongest variants on the Pareto frontier. This process turns small numbers of attempts into large performance gains. The approach claims to beat GRPO by 6 percent on average across six tasks while using up to 35 times fewer rollouts and also exceeds leading prompt optimizers such as MIPROv2.

Core claim

GEPA demonstrates that thoroughly incorporating natural language reflection to diagnose issues, propose updates, and combine complementary lessons from the Pareto frontier of attempts allows high-level rules to be learned from trial and error, often turning just a few rollouts into superior prompt quality compared with policy gradients from many scalar rewards.

What carries the argument

The Genetic-Pareto prompt optimizer that reflects in natural language on sampled trajectories to diagnose problems, test updates, and combine insights from the Pareto frontier of its own attempts.

If this is right

  • Prompt-based adaptation of LLMs can replace or reduce the need for RL on many downstream tasks.
  • AI systems containing multiple prompts can be improved through repeated cycles of reflection and update.
  • Substantial quality gains become possible even when only a handful of environment interactions are affordable.
  • The same reflection-driven search can serve as an inference-time strategy for code optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that already process language may extract more useful training information from explicit reflections than from numeric reward signals alone.
  • The same reflection-plus-Pareto mechanism could be applied to optimize other structured components inside AI pipelines beyond prompts.
  • Lower rollout counts could translate directly into reduced compute budgets when adapting models to new tasks.

Load-bearing premise

Natural language reflection on trajectories supplies a richer and more effective learning signal than policy gradients derived from sparse scalar rewards.

What would settle it

A head-to-head run on the same tasks in which GRPO, limited to the same small number of rollouts GEPA uses, matches or exceeds GEPA's final performance.

read the original abstract

Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across six tasks, GEPA outperforms GRPO by 6% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% (e.g., +12% accuracy on AIME-2025), and demonstrates promising results as an inference-time search strategy for code optimization. We release our code at https://github.com/gepa-ai/gepa .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GEPA (Genetic-Pareto), a prompt optimizer that samples trajectories from an AI system, reflects on them in natural language to diagnose problems, proposes and tests prompt updates, and combines lessons across the Pareto frontier of attempts. It claims that across six tasks GEPA outperforms GRPO by 6% on average (up to 20%) while using up to 35x fewer rollouts, outperforms MIPROv2 by over 10%, and shows promise as an inference-time search method for code optimization. The code is released at https://github.com/gepa-ai/gepa.

Significance. If the empirical comparisons prove robust, the work indicates that natural-language reflection can yield richer learning signals than scalar-reward policy gradients, enabling larger gains from far fewer rollouts than RL methods such as GRPO. This would support shifting emphasis toward interpretable, language-mediated adaptation for LLM prompt engineering. The public code release is a clear strength that facilitates direct reproduction and extension.

major comments (2)
  1. Abstract: the efficiency claim rests on GEPA using 'up to 35x fewer rollouts' than GRPO. The method description, however, requires separate LLM calls for trajectory reflection, problem diagnosis, prompt-update proposal, update testing, and Pareto-frontier combination. Without a reported breakdown of total LLM invocations or token budget, the claimed rollout reduction does not yet establish an overall computational advantage.
  2. Abstract: average gains of 6% over GRPO and >10% over MIPROv2 are stated without accompanying information on statistical significance, run-to-run variance, exact data splits, baseline hyper-parameters, or task definitions. These omissions make it impossible to determine whether the reported improvements are load-bearing or could be explained by implementation differences or chance.
minor comments (1)
  1. The abstract would be clearer if the six tasks were named explicitly rather than referred to generically.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, providing clarifications where possible and committing to revisions that strengthen the manuscript's transparency on computational costs and statistical robustness.

read point-by-point responses
  1. Referee: Abstract: the efficiency claim rests on GEPA using 'up to 35x fewer rollouts' than GRPO. The method description, however, requires separate LLM calls for trajectory reflection, problem diagnosis, prompt-update proposal, update testing, and Pareto-frontier combination. Without a reported breakdown of total LLM invocations or token budget, the claimed rollout reduction does not yet establish an overall computational advantage.

    Authors: We appreciate this observation. Our efficiency claim is deliberately scoped to rollouts because, in RL methods such as GRPO, the dominant computational expense arises from sampling large numbers of trajectories to compute policy gradients from scalar rewards. GEPA's natural-language reflection is designed to extract richer diagnostic information from each rollout, enabling substantial performance gains with far fewer such samples. Nevertheless, we agree that readers would benefit from a fuller accounting of total LLM invocations and token usage. In the revised manuscript we will add a dedicated table (or subsection in the experiments) that reports the total number of LLM calls and approximate token budgets for GEPA versus GRPO and MIPROv2 on each task, thereby allowing direct comparison of overall computational cost. revision: yes

  2. Referee: Abstract: average gains of 6% over GRPO and >10% over MIPROv2 are stated without accompanying information on statistical significance, run-to-run variance, exact data splits, baseline hyper-parameters, or task definitions. These omissions make it impossible to determine whether the reported improvements are load-bearing or could be explained by implementation differences or chance.

    Authors: The task definitions, data splits, and baseline hyper-parameter choices are described in detail in Section 4 (Experimental Setup) of the manuscript, and the per-task results underlying the 6 % and >10 % averages are reported in Tables 1–3. We acknowledge, however, that the abstract itself does not reference run-to-run variance or statistical significance tests. In the revision we will (i) add standard deviations or error bars from multiple independent runs to the main results tables, (ii) include p-values from paired statistical tests (e.g., Wilcoxon signed-rank or t-tests) to support the reported improvements, and (iii) ensure the abstract either summarizes these details or explicitly directs readers to the relevant sections and tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims or method definition

full rationale

The paper introduces GEPA as an algorithmic prompt optimizer that uses trajectory sampling, natural-language reflection, diagnosis, update proposals, testing, and Pareto-frontier combination. Its central claims are empirical performance comparisons (outperforming GRPO by 6% average / up to 20% and MIPROv2 by >10% across six tasks, with fewer rollouts). No equations, derivations, fitted parameters renamed as predictions, or self-citations that reduce results to tautological inputs appear in the provided text. The method is defined procedurally and evaluated against external baselines, rendering the claims self-contained rather than circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that language reflection yields richer learning than scalar-reward policy gradients; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The interpretable nature of language provides a much richer learning medium for LLMs compared to policy gradients derived from sparse scalar rewards.
    This premise is stated directly as the motivation for introducing GEPA.

pith-pipeline@v0.9.0 · 5630 in / 1126 out tokens · 36072 ms · 2026-05-12T07:21:20.287427+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. Continual Harness: Online Adaptation for Self-Improving Foundation Agents

    cs.LG 2026-05 conditional novelty 8.0

    Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...

  3. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 unverdicted novelty 8.0

    SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.

  4. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

    cs.AI 2026-05 accept novelty 8.0

    SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

  5. SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

    cs.SE 2026-05 unverdicted novelty 8.0

    SmellBench evaluates 11 LLM agent setups on 65 architectural smells, finding 47.7% best resolution rate, 63.1% false positives per experts, strong false-positive detection (κ=0.94), but aggressive repairs adding up to...

  6. Harnessing Agentic Evolution

    cs.AI 2026-05 unverdicted novelty 7.0

    AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

  7. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  8. AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.

  9. Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

  10. CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG

    cs.LG 2026-05 unverdicted novelty 7.0

    CDS4RAG cyclically optimizes full RAG hyperparameters by distinguishing and alternating between retriever and generator components, boosting performance up to 1.54x over prior methods on benchmarks.

  11. SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

    cs.SE 2026-05 unverdicted novelty 7.0

    SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.

  12. Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    A knowledge-first approach to LLM-driven automatic heuristic design in combinatorial optimization yields better discovery efficiency, transfer, and generalization than code-centric baselines by formalizing a distortio...

  13. LLM-HYPER: Generative CTR Modeling for Cold-Start Ad Personalization via LLM-Based Hypernetworks

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM-HYPER treats an LLM as a hypernetwork that outputs feature-wise weights for a linear CTR model from few-shot multimodal ad examples, achieving 55.9% better NDCG@10 than cold-start baselines and successful producti...

  14. M$^\star$: Every Task Deserves Its Own Memory Harness

    cs.PL 2026-04 unverdicted novelty 7.0

    M* evolves distinct Python memory programs per task via population-based reflective search, outperforming fixed-memory baselines on conversation, planning, and reasoning benchmarks.

  15. Unlocking Prompt Infilling Capability for Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

  16. Meta-Harness: End-to-End Optimization of Model Harnesses

    cs.AI 2026-03 unverdicted novelty 7.0

    Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...

  17. Prompt Segmentation and Annotation Optimisation: Controlling LLM Behaviour via Optimised Segment-Level Annotations

    cs.AI 2026-05 unverdicted novelty 6.0

    PSAO decomposes prompts into annotated segments to improve LLM reasoning accuracy and self-consistency as a proof-of-concept framework.

  18. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  19. FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...

  20. SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents

    cs.LG 2026-05 unverdicted novelty 6.0

    SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage poin...

  21. FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

    cs.AI 2026-05 unverdicted novelty 6.0

    FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.

  22. PrismaDV: Automated Task-Aware Data Unit Test Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt ...

  23. Evaluation-driven Scaling for Scientific Discovery

    cs.LG 2026-04 unverdicted novelty 6.0

    SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...

  24. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  25. Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Meta-prompt optimization enables LLM agents to discover stable, generalizable tacit collusion strategies in market simulations that outperform hand-crafted prompt baselines.

  26. Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.

  27. AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...

  28. Harnessing Pre-Resolution Signals for Future Prediction Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Milkyway evolves a future prediction harness using internal feedback from repeated predictions on the same unresolved question, achieving top scores on FutureX (44.07 to 60.90) and FutureWorld (62.22 to 77.96).

  29. Agent-Aided Design for Dynamic CAD Models

    cs.AI 2026-04 unverdicted novelty 6.0

    AADvark extends agent-aided CAD design to dynamic 3D assemblies with movable parts by integrating constraint solvers and visual feedback to create a verification signal for the agent.

  30. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  31. ExecTune: Effective Steering of Black-Box LLMs with Guide Models

    cs.LG 2026-04 unverdicted novelty 6.0

    ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...

  32. AI-Driven Research for Databases

    cs.DB 2026-04 unverdicted novelty 6.0

    Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.

  33. Reflective Context Learning: Studying the Optimization Primitives of Context Space

    cs.LG 2026-04 unverdicted novelty 6.0

    Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...

  34. Self-Optimizing Multi-Agent Systems for Deep Research

    cs.IR 2026-04 unverdicted novelty 6.0

    Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.

  35. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

    cs.AI 2026-05 unverdicted novelty 5.0 partial

    Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.

  36. EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents

    cs.AI 2026-05 unverdicted novelty 5.0

    EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.

  37. GEAR: Genetic AutoResearch for Agentic Code Evolution

    cs.NE 2026-05 unverdicted novelty 5.0

    GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.

  38. PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

    cs.LG 2026-05 unverdicted novelty 5.0

    PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...

  39. A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis

    cs.LG 2026-05 unverdicted novelty 5.0

    The paper introduces a reproducible optimization protocol for prompt-based LLM workflows in evidence synthesis that separates task definitions from prompt harnesses, optimizes the harness against metrics and examples,...

  40. KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant

    cs.SE 2026-04 unverdicted novelty 5.0

    KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent his...

  41. Harnessing Pre-Resolution Signals for Future Prediction Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    Milkyway uses pre-resolution signals from temporal contrasts in evolving evidence and repeated forecasts to evolve a harness and improve predictions before resolution, outperforming baselines on FutureX and FutureWorld.

  42. Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

    cs.CL 2026-04 unverdicted novelty 5.0

    AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.

  43. Supplement Generation Training for Enhancing Agentic Task Performance

    cs.LG 2026-04 unverdicted novelty 4.0

    SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 39 Pith papers

  1. [1]

    Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia

    URLhttps://arxiv.org/abs/2507.14403. Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP.arXiv preprint arXiv:2212.14024, 2022. 16 Accepted at ICLR 2026 (Oral). Omar Khattab, Arnav Singhvi, Paridhi Mahes...

  2. [2]

    White paper

    URLhttps://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/ alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/ AlphaEvolve.pdf. White paper. OpenAI. GPT-4.1 series, 2025. Large language model series, released April 2025.https://openai.com/ index/gpt-4-1/. Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Chri...

  3. [3]

    Optimizing instructions and demonstrations for multi-stage language model programs

    Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.525. URLhttps: //aclanthology.org/2024.emnlp-main.525/. Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Re, and Azalia Mirho- seini. Kernelbench: Can LLMs write efficient GPU kernels? InScaling Self-Improving Foundation Models without Human Supervision,...

  4. [4]

    constant with warmup learning

    with rank dimension 16,α= 64, and dropout 0.05, using bf16 precision targeting the projection modules[q,k,v,o,up,down,gate]. We use a learning rate of1×10 −5,β= 0.01, reward scale normaliza- tion, and gradient norm clipping of 0.1. Gradients are accumulated for 20 steps before each update, with a “constant with warmup learning” rate scheduler. Non-reentra...

  5. [5]

    - When queries contain location, dates, names, URLs, or other identifiable details, generalize or omit them in the LLM request

    Privacy Preservation: - Do not include any user-specific or sensitive data in the external LLM request. - When queries contain location, dates, names, URLs, or other identifiable details, generalize or omit them in the LLM request. - Replace or abstract any private or potentially identifiable content with neutral placeholders or general terms without losi...

  6. [6]

    - Determine if the query involves translation, event recommendations, advice, summarization, or other tasks

    Query Understanding and Reformulation: - Analyze the user's query carefully to understand the underlying task or information need. - Determine if the query involves translation, event recommendations, advice, summarization, or other tasks. 34 Accepted at ICLR 2026 (Oral). - Identify when user-provided content is sensitive or proprietary (e.g., unique text...

  7. [7]

    - Retain the core informational or functional need so that the LLM can respond effectively

    Quality Maximization: - Produce an LLM request that is clear, precise, and directs the LLM to perform the necessary task without requiring private context. - Retain the core informational or functional need so that the LLM can respond effectively. - When referencing external documents or URLs, do not include the actual link or private info; instead, reque...

  8. [8]

    - Request generalized or example-based information instead of specific user data

    Common Strategies for Privacy-preserving Requests: - Use paraphrasing or abstraction for sensitive content. - Request generalized or example-based information instead of specific user data. - When translation is required, include only the text that needs translating if non-private, or otherwise generalize accordingly. - For location- or event-specific que...

  9. [9]

    Input Format: - A user query string possibly containing private or sensitive information

    Reasoning Explanation: - For each transformation, produce clear reasoning explaining how privacy is preserved and why the LLM request fulfills the userâĂŹs intent without leaking sensitive information. Input Format: - A user query string possibly containing private or sensitive information. - Required output: a) A reasoning paragraph explaining your analy...

  10. [10]

    - When the user query contains private, sensitive, or proprietary data, you must generalize, abstract, or omit these details

    Privacy Preservation: - Do not include any user-specific details, personal names, locations, dates, URLs, or other potentially identifiable information in the outgoing LLM request. - When the user query contains private, sensitive, or proprietary data, you must generalize, abstract, or omit these details. 35 Accepted at ICLR 2026 (Oral). - Replace sensiti...

  11. [11]

    - When reformulating, maintain the essential informational or functional need so that the external LLM can provide a useful, relevant response

    Understanding and Reformulation: - Carefully analyze the user's query to identify the underlying task type (e.g., summarization, creative writing, translation, profile writing, event recommendations, company background, academic content generation). - When reformulating, maintain the essential informational or functional need so that the external LLM can ...

  12. [12]

    - Retain appropriate detail and context to ensure relevance, but balance this carefully against privacy concerns

    Maximizing Quality of the Reformulated Prompt: - Construct clear, precise, and well-structured requests that explicitly guide the external LLM on the task. - Retain appropriate detail and context to ensure relevance, but balance this carefully against privacy concerns. - If referencing external documents, URLs, or institutions, do not include any links or...

  13. [13]

    - Use general descriptions or hypothetical/example-based requests where appropriate

    Common Strategies for Privacy Preservation: - Paraphrase or abstract personal and sensitive content. - Use general descriptions or hypothetical/example-based requests where appropriate. - Omit or generalize specific names, dates, locations, institutions, or proprietary course codes. - When user content is extensive, focus on absorbing the core themes and ...

  14. [14]

    an interdisciplinary health minor

    Explanation Requirement: - Provide a concise reasoning paragraph explaining how you identified sensitive or private details and the steps you took to protect user privacy. - Clarify how your reformulated LLM request retains the userâĂŹs original intent and task needs without risking data leakage. - This explanation is mandatory to document your privacy-pr...

  15. [15]

    37 Accepted at ICLR 2026 (Oral)

    Privacy Preservation Principles: - Remove or replace all user-specific names (personal or organizational), exact dates or durations, locations, URLs, proprietary course or product codes, customer or client names, and other identifiers. 37 Accepted at ICLR 2026 (Oral). - When geographic or organizational mentions are critical for context, abstract these to...

  16. [16]

    - Preserve the functional intent and thematic requirements (e.g., content topics around sustainability, summary of a personâĂŹs background, professional email follow-up)

    Understanding and Reformulating the Task: - Identify the underlying task (creative writing, summarization, professional communication drafting, etc.) from the userâĂŹs query. - Preserve the functional intent and thematic requirements (e.g., content topics around sustainability, summary of a personâĂŹs background, professional email follow-up). - For user-...

  17. [17]

    - Avoid ambiguous or overly generic requests that might reduce relevance or usefulness

    Maximizing Output Quality While Preserving Privacy: - Construct a prompt that is clear, precise, and contains sufficient context to enable comprehensive and relevant LLM output. - Avoid ambiguous or overly generic requests that might reduce relevance or usefulness. - Maintain a balance between detail necessary for quality and generalization required for privacy

  18. [18]

    Common Reformulation Strategies: - Replace specific names with generic role identifiers or placeholders (âĂa business contact,âĂİ âĂa notable individualâĂİ). - Replace specific locations or institutions with generalized descriptors, e.g., âĂa country known for eco-tourism.âĂİ - For time references, use relative or approximate terms without revealing expli...

  19. [19]

    - Explain how the essential task was preserved despite abstraction

    Explanation Requirements: - Always provide reasoning that details how privacy risks were identified and mitigated. - Explain how the essential task was preserved despite abstraction. - This reasoning documents the privacy-preserving approach and justifies design choices. Examples and Common Pitfalls: - Do not retain or lightly obscure personal names or co...

  20. [20]

    Never lightly obscure or partially redact; full abstraction is required

    Identification and Treatment of Sensitive Data: - All user-specific or personal names (individual or organizational) must be removed or replaced with generic role descriptors (e.g., âĂa business contact,âĂİ âĂa client,âĂİ âĂa notable individualâĂİ). Never lightly obscure or partially redact; full abstraction is required. 39 Accepted at ICLR 2026 (Oral). -...

  21. [23]

    Explanation Requirements: - The reasoning must transparently explain how privacy risks were identified (e.g., presence of names, locations, dates, proprietary terms). - It must describe the abstraction or omission methods applied (e.g., replacing âĂJonah Van BeijnenâĂİ with âĂa notable individual,âĂİ substituting âĂMakauâĂİ with âĂa specific region,âĂİ or...

  22. [24]

    -`summary_1`is a concise summary of information from a document retrieved in the first hop, which partially addresses the question

    **Input Understanding:** -`question`is the original multi-hop question posed by the user. -`summary_1`is a concise summary of information from a document retrieved in the first hop, which partially addresses the question

  23. [25]

    - The multi-hop retrieval system works in stages: - First hop: The original question returns some documents

    **Purpose and Context:** - Your generated`query`aims to find the *missing pieces* of information needed to fully answer the`question`. - The multi-hop retrieval system works in stages: - First hop: The original question returns some documents. - Second hop: Your query must help retrieve any *other relevant documents* NOT found in the first hop that hold c...

  24. [26]

    Madeira archipelago population in 2011

    **Key Observations from Examples and Feedback:** - First-hop documents often cover one entity or aspect in the question. - Remaining relevant documents often involve connected or higher-level concepts mentioned in`summary_1`but not explicitly asked in the original question. - The`query`should be formulated to explicitly target these *missing*, but logical...

  25. [27]

    - Reframe the query to explicitly mention these broader or related entities connected to the original question

    **How to Build the Query:** - Identify the entities or topics mentioned in`summary_1`that appear related but different from first-hop documents. - Reframe the query to explicitly mention these broader or related entities connected to the original question. - Include relevant key context from the question to maintain specificity, but shift focus to the mis...

  26. [28]

    What entity or aspect does this summary hint at that could answer the original question but was not found yet?

    **Practical Strategy:** - Read the`summary_1`carefully to spot references to bigger contexts or other entities not covered in the first hop. - Ask yourself, "What entity or aspect does this summary hint at that could answer the original question but was not found yet?" - Formulate a precise, focused factual query targeting that entity or concept to retrie...

  27. [29]

    - Ensure the query relates logically to the original question while targeting the broader or complementary knowledge identified in`summary_1`

    **Output:** - Produce only the field`query`as a clear, concise question or keyword phrase designed for efficient retrieval of **second-hop documents**. - Ensure the query relates logically to the original question while targeting the broader or complementary knowledge identified in`summary_1`. - Do **not** include the original question or simply rephrase ...

  28. [30]

    **Understand the question precisely.** Determine exactly what is askedâĂŤwhether a name, a specific fact, a date, or a yes/no answer

  29. [31]

    - If one summary provides a fact that the other does not mention, carefully evaluate its plausibility

    **Compare both summaries.** Analyze the content of`summary_1`and`summary_2`: - If they agree and directly answer the question, use this as primary evidence. - If one summary provides a fact that the other does not mention, carefully evaluate its plausibility. - If the summaries conflict, use domain expertise and authoritative knowledge to resolve or expli...

  30. [32]

    - **Names and nicknames:** Provide only the specific nickname or name when asked, without extra phrasing

    **Domain-specific factual verification and nuance:** 44 Accepted at ICLR 2026 (Oral). - **Names and nicknames:** Provide only the specific nickname or name when asked, without extra phrasing. For example, when asked for the nickname of a person or entity, respond with the nickname alone, not a full sentence. - **Nationality and identity distinctions:** Us...

  31. [33]

    - Avoid repeating or restating the question

    **Answer conciseness and relevance:** - Provide a brief and direct answer to the question. - Avoid repeating or restating the question. - Avoid unnecessary context unless requested or needed for clarity. - Avoid constructing full sentences unless needed; for example, answers to nickname or yes/no questions should be as short and specific as possible

  32. [34]

    - For example, when a summary gives a year that conflicts with known release dates or factual details, prefer the verified date

    **When authoritative knowledge supplements the summaries:** - If the summaries are incomplete or potentially inaccurate, incorporate trusted knowledge from your training about the topic to provide the correct and precise answer. - For example, when a summary gives a year that conflicts with known release dates or factual details, prefer the verified date....

  33. [35]

    Chiesa di Filippini Madonna di Galliera e Filippo Neri

    **Examples of correct reasoning and answers:** - Question: âĂWhat is the nickname of the 2005 Toyota Grand Prix of Long Beach Polesitter ?âĂİ - Correct answer:`the thrill from West Hill` - Question: âĂWhat type of company is Zipcar led by Scott Griffith from 2003-2013?âĂİ - Correct answer:`car-sharing company` - Question: âĂWho was the partner of British ...

  34. [36]

    Directly relates to the initial question

  35. [37]

    Captures the core facts and entities needed to understand the scope and context of the question

  36. [38]

    Includes relevant connections, bridging entities, dates, locations, or descriptions that enable the system to devise focused and effective follow-up queries in subsequent hops

  37. [39]

    Children in Need 2006 |

    Provides a strong factual foundation for downstream answer generation modules. **Task specifics and best practices:** - The`summary`must represent a distilled synthesis, not just a compression or extractive snippet. - Explicitly include cited passage titles or key entity labels (e.g., "Children in Need 2006 | ..." or "Anthony Levandowski | ...") in your s...

  38. [40]

    **Objective:** Your query must target documents not retrieved in the first hop, using clues from the summary and the original question

  39. [41]

    - Use explicit information from the summary (e.g., names, locations, quantities) to rephrase the question into a query that surfaces new relevant documents

    **Key Strategy:** - Identify gaps in the first hop's retrieved documents (e.g., missing entities, relationships, or specific details). - Use explicit information from the summary (e.g., names, locations, quantities) to rephrase the question into a query that surfaces new relevant documents. - Avoid restating the answer directly; instead, structure the que...

  40. [42]

    What is the headquarters location of [Company]?

    **Domain-Specific Guidance:** - If the summary explicitly answers the question, the query should still focus on retrieving documents that provide deeper context or verify the answer (e.g., "What is the headquarters location of [Company]?" instead of "The answer is [Location]"). - Leverage entities mentioned in the summary (e.g., "Carhartt," "Aubrey O'Day"...

  41. [43]

    - Assuming the summary contains all necessary information for the second hop

    **Avoid:** - Generating queries that duplicate the original question. - Assuming the summary contains all necessary information for the second hop. HotpotQA Qwen3 8B final_answer.predict Base Prompt: Given the fields`question`,`summary_1`,`summary_2`, produce the fields`answer`. MIPROv2 Prompt: Given the question, summary_1, and summary_2, generate a step...

  42. [44]

    Medicare

    **Extracting precise terminology**: Identify the exact noun or specific term required in the answer (e.g., "Medicare" rather than "Medicare cuts"). Avoid vague or generalized terms unless explicitly stated in the summaries

  43. [45]

    second Duke of Florence

    **Resolving ambiguity**: If the question references a title, historical role, or specific designation (e.g., "second Duke of Florence"), prioritize contextual or historical clues from the summaries to infer the correct answer, even if the exact term is not explicitly stated. Use domain-specific knowledge (e.g., Medici family lineage) to fill gaps when sum...

  44. [46]

    If summaries conflict, prioritize the one with explicit factual claims (e.g., numerical data, direct statements)

    **Cross-referencing summaries**: Ensure consistency between summaries. If summaries conflict, prioritize the one with explicit factual claims (e.g., numerical data, direct statements). If no explicit claim exists, synthesize information while ensuring alignment with historical, political, or cultural context

  45. [47]

    Do not add context, explanations, or external knowledge beyond what is explicitly provided

    **Avoiding overgeneralization and extra information**: Focus strictly on the most specific and directly stated information in the summaries. Do not add context, explanations, or external knowledge beyond what is explicitly provided. For example, if the question asks for a year, provide only the year; do not include band member details or historical background

  46. [48]

    Path to Prosperity

    **Prioritizing factual alignment**: If a summary explicitly states the answer, use that. If summaries are indirect or vague, synthesize information while ensuring alignment with factual knowledge (e.g., linking "Path to Prosperity" to Rep. Paul RyanâĂŹs Medicare proposal). **Key adjustments based on feedback**: - **Conciseness**: Answers must be strictly ...

  47. [49]

    Put on the Spot

    is a German author, philosopher, academic and film director.','Cattle King | Cattle King is a 1963 film directed by Tay Garnett. It stars Robert Taylor and Robert Loggia. It also appears to have been called Guns of Wyoming in some countries.', 'GÃnther von Kluge | GÃnther von Kluge (30 October 1882 âĂŞ 19 August 1944) was a German field marshal during Wor...

  48. [50]

    **Extracts direct answers** from the top retrieved passages to address the question

  49. [51]

    **Identifies and highlights missing or implied clues** that may require further retrieval (e.g., entities, connections, or contextual details)

  50. [52]

    Billy Truax

    **Synthesizes information** by combining explicit facts from the passages with domain-specific knowledge or logical inferences to guide subsequent steps. ### **Summary Structure** - **Entity/Person Mention**: Clearly state the subject (e.g., "Billy Truax", "Eintracht Braunschweig") and include **full names, titles, or official designations** (e.g., "Thoma...

  51. [53]

    **Explicit Answers First**: Prioritize explicitly stated facts from the context and passages (e.g., direct mentions of entities, roles, or relationships)

  52. [54]

    **Infer or Generalize When Necessary**: If critical details are missing from the passages, infer connections or generalize based on contextual clues and domain-specific knowledge (e.g., linking ownership structures, roles, or historical context)

  53. [55]

    Newcastle United,

    **Bridge Gaps**: Ensure the summary includes all **key supporting information** required to answer the question, even if it is not explicitly stated in the input. For example: - If the answer is "Newcastle United," include details about Sports Direct's ownership and the connection to the billionaire. - If the answer is a person's role (e.g., "troubleshoot...

  54. [56]

    Stan Kroenke owns Sports Direct and Arsenal F.C

    **Structure and Precision**: - Clearly connect entities, roles, and relationships (e.g., "Stan Kroenke owns Sports Direct and Arsenal F.C."). - Avoid ambiguity by including all necessary contextual links (e.g., "Mike Ashley founded Sports Direct and owns Newcastle United"). - Use precise terminology and ensure alignment with domain-specific knowledge (e.g...

  55. [57]

    Project RAND

    **Domain-Specific Knowledge**: Leverage implicit domain knowledge when passages lack critical details (e.g., knowing that "Project RAND" is linked to Henry H. Arnold and the RAND Corporation). ### Example Integration: If the question is about a person's profession in a novel, ensure the summary includes: - The character's name. - Their profession (explici...

  56. [58]

    Carefully interpret the mathematical or logical problem

  57. [59]

    Show your reasoning internally to confirm the final answer (reasoning does not need to be included in the response unless explicitly requested)

  58. [60]

    That's my answer

    Provide the final direct response strictly following all instructions, especially when asked to repeat the query verbatim first before giving the answer. General approach: - Always parse the query thoroughly to extract every constraint and instruction. - Ensure your response exactly matches the format, wording, and content as instructed. - Do not invent o...

  59. [61]

    - Specific length constraints (number of sentences, bullet points, word counts)

    **Query Parsing and Extraction of Instructions** - Carefully read the entire query to identify all explicit instructions concerning: - Whether and how to repeat the query text (verbatim or partially). - Specific length constraints (number of sentences, bullet points, word counts). - Formatting instructions (e.g., capitalization requirements, quotation mar...

  60. [62]

    - Do not prepend or append anything to the repeated text unless explicitly instructed

    **Exact Text Reproduction** - When asked to repeat the query text (or any other required phrase) verbatim, do so with zero changes âĂŤ no added or removed words, punctuation, or formatting. - Do not prepend or append anything to the repeated text unless explicitly instructed. - Preserve all original capitalization, spacing, and punctuation exactly as in the query

  61. [63]

    - Using specified markdown bullet point styles (e.g., asterisks)

    **Structural and Formatting Compliance** - Follow all formatting instructions strictly, such as: - Wrapping the entire response in quotation marks if required. - Using specified markdown bullet point styles (e.g., asterisks). - Ensuring capitalization instructions (e.g., all caps or minimum occurrences of uppercase words) are perfectly met. - Adhering to ...

  62. [64]

    - Use domain knowledge and reliable calculations to ensure factual correctness in answers

    **Response Content Accuracy and Appropriateness** - After fulfilling all structural requirements, respond to the main substantive question accurately and completely. - Use domain knowledge and reliable calculations to ensure factual correctness in answers. - For questions requesting sensitive or potentially harmful content (e.g., cures without scientific ...

  63. [65]

    - Your final output must be the exact, ready-to-deliver response that meets all user instructions perfectly

    **No Extraneous Text** - Do not add explanations, internal reasoning, apologies, or meta commentary beyond what the query explicitly permits or demands. - Your final output must be the exact, ready-to-deliver response that meets all user instructions perfectly

  64. [66]

    Reasoning:

    **Examples and Patterns Observed** - Users often combine multiple complex formatting and content instructions (e.g., repetition of request text, followed by specific number of sentences or bullet points, with capitalization rules). - Ensure you carefully distinguish when to repeat the query text verbatim and when to respond directly (sometimes the repetit...

  65. [67]

    **First**, repeat the user's query **word for word** without any changes or additions

  66. [68]

    **Then**, provide your answer in the specified format, adhering to all constraints (e.g., markdown, structure, content)

  67. [69]

    **Do not include any additional text, explanations, or formatting** beyond the repeated query and your answer

  68. [70]

    **Include niche/domain-specific factual details** (e.g., technical commands, best practices, or platform-specific configurations) if applicable, as these are critical for accurate task completion

  69. [71]

    **Use precise formatting** (e.g., bullet points, code blocks, headers) as requested, ensuring no markdown is omitted or altered

  70. [72]

    **Avoid generalizable strategies** unless explicitly instructed; focus on actionable, specific guidance

  71. [73]

    **Validate all technical steps** (e.g., Dockerfile syntax, CLI commands) for accuracy and completeness

  72. [74]

    **Highlight potential pitfalls and solutions** to address common issues in the task

  73. [75]

    **Prioritize clarity and conciseness**, ensuring the response is both comprehensive and easy to follow

  74. [76]

    62 Accepted at ICLR 2026 (Oral)

    **Adhere to language and case requirements** (e.g., all caps, English only) if specified. 62 Accepted at ICLR 2026 (Oral). L.5 HOVER, GPT-4.1 MINI HoVer GPT-4.1 Mini create_query_hop2.predict Base Prompt: Given the fields`claim`,`summary_1`, produce the fields`query`. MIPROv2 Prompt: Given the original claim and the initial summary of retrieved evidence, ...

  75. [77]

    **Extract key factual elements from the claim** âĂŤ names, dates, titles, roles, events, or relationships explicitly or implicitly stated

  76. [78]

    **Contrast these facts with the summary to identify points of agreement, contradiction, or ambiguity.**

  77. [79]

    - Include named entities, dates, roles, or other domain-specific identifiers directly mentioned in both claim and summary to improve retrieval effectiveness

    **Formulate fact-checking queries that are:** - Tightly focused on the core factual issues raised by the claim and addressed or contradicted by the summary. - Include named entities, dates, roles, or other domain-specific identifiers directly mentioned in both claim and summary to improve retrieval effectiveness. - When relevant, break complex claims into...

  78. [80]

    **When relevant details appear only in the summary but are hinted at or missing from the claim (e.g., specific titles, roles, or names), include these in the queries to enable retrieval of key evidence.**

  79. [81]

    **Use a clear, natural question format or targeted keyword phrases that could serve well as search queries.**

  80. [82]

    **Avoid overly broad or generic queries; precision improves evidence retrieval quality.**

Showing first 80 references.