Recognition: 3 theorem links
· Lean TheoremGEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Pith reviewed 2026-05-12 07:21 UTC · model grok-4.3
The pith
Natural language reflection on a few trajectories lets prompt evolution outperform RL with up to 35 times fewer rollouts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GEPA demonstrates that thoroughly incorporating natural language reflection to diagnose issues, propose updates, and combine complementary lessons from the Pareto frontier of attempts allows high-level rules to be learned from trial and error, often turning just a few rollouts into superior prompt quality compared with policy gradients from many scalar rewards.
What carries the argument
The Genetic-Pareto prompt optimizer that reflects in natural language on sampled trajectories to diagnose problems, test updates, and combine insights from the Pareto frontier of its own attempts.
If this is right
- Prompt-based adaptation of LLMs can replace or reduce the need for RL on many downstream tasks.
- AI systems containing multiple prompts can be improved through repeated cycles of reflection and update.
- Substantial quality gains become possible even when only a handful of environment interactions are affordable.
- The same reflection-driven search can serve as an inference-time strategy for code optimization.
Where Pith is reading between the lines
- Models that already process language may extract more useful training information from explicit reflections than from numeric reward signals alone.
- The same reflection-plus-Pareto mechanism could be applied to optimize other structured components inside AI pipelines beyond prompts.
- Lower rollout counts could translate directly into reduced compute budgets when adapting models to new tasks.
Load-bearing premise
Natural language reflection on trajectories supplies a richer and more effective learning signal than policy gradients derived from sparse scalar rewards.
What would settle it
A head-to-head run on the same tasks in which GRPO, limited to the same small number of rollouts GEPA uses, matches or exceeds GEPA's final performance.
read the original abstract
Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across six tasks, GEPA outperforms GRPO by 6% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% (e.g., +12% accuracy on AIME-2025), and demonstrates promising results as an inference-time search strategy for code optimization. We release our code at https://github.com/gepa-ai/gepa .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GEPA (Genetic-Pareto), a prompt optimizer that samples trajectories from an AI system, reflects on them in natural language to diagnose problems, proposes and tests prompt updates, and combines lessons across the Pareto frontier of attempts. It claims that across six tasks GEPA outperforms GRPO by 6% on average (up to 20%) while using up to 35x fewer rollouts, outperforms MIPROv2 by over 10%, and shows promise as an inference-time search method for code optimization. The code is released at https://github.com/gepa-ai/gepa.
Significance. If the empirical comparisons prove robust, the work indicates that natural-language reflection can yield richer learning signals than scalar-reward policy gradients, enabling larger gains from far fewer rollouts than RL methods such as GRPO. This would support shifting emphasis toward interpretable, language-mediated adaptation for LLM prompt engineering. The public code release is a clear strength that facilitates direct reproduction and extension.
major comments (2)
- Abstract: the efficiency claim rests on GEPA using 'up to 35x fewer rollouts' than GRPO. The method description, however, requires separate LLM calls for trajectory reflection, problem diagnosis, prompt-update proposal, update testing, and Pareto-frontier combination. Without a reported breakdown of total LLM invocations or token budget, the claimed rollout reduction does not yet establish an overall computational advantage.
- Abstract: average gains of 6% over GRPO and >10% over MIPROv2 are stated without accompanying information on statistical significance, run-to-run variance, exact data splits, baseline hyper-parameters, or task definitions. These omissions make it impossible to determine whether the reported improvements are load-bearing or could be explained by implementation differences or chance.
minor comments (1)
- The abstract would be clearer if the six tasks were named explicitly rather than referred to generically.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, providing clarifications where possible and committing to revisions that strengthen the manuscript's transparency on computational costs and statistical robustness.
read point-by-point responses
-
Referee: Abstract: the efficiency claim rests on GEPA using 'up to 35x fewer rollouts' than GRPO. The method description, however, requires separate LLM calls for trajectory reflection, problem diagnosis, prompt-update proposal, update testing, and Pareto-frontier combination. Without a reported breakdown of total LLM invocations or token budget, the claimed rollout reduction does not yet establish an overall computational advantage.
Authors: We appreciate this observation. Our efficiency claim is deliberately scoped to rollouts because, in RL methods such as GRPO, the dominant computational expense arises from sampling large numbers of trajectories to compute policy gradients from scalar rewards. GEPA's natural-language reflection is designed to extract richer diagnostic information from each rollout, enabling substantial performance gains with far fewer such samples. Nevertheless, we agree that readers would benefit from a fuller accounting of total LLM invocations and token usage. In the revised manuscript we will add a dedicated table (or subsection in the experiments) that reports the total number of LLM calls and approximate token budgets for GEPA versus GRPO and MIPROv2 on each task, thereby allowing direct comparison of overall computational cost. revision: yes
-
Referee: Abstract: average gains of 6% over GRPO and >10% over MIPROv2 are stated without accompanying information on statistical significance, run-to-run variance, exact data splits, baseline hyper-parameters, or task definitions. These omissions make it impossible to determine whether the reported improvements are load-bearing or could be explained by implementation differences or chance.
Authors: The task definitions, data splits, and baseline hyper-parameter choices are described in detail in Section 4 (Experimental Setup) of the manuscript, and the per-task results underlying the 6 % and >10 % averages are reported in Tables 1–3. We acknowledge, however, that the abstract itself does not reference run-to-run variance or statistical significance tests. In the revision we will (i) add standard deviations or error bars from multiple independent runs to the main results tables, (ii) include p-values from paired statistical tests (e.g., Wilcoxon signed-rank or t-tests) to support the reported improvements, and (iii) ensure the abstract either summarizes these details or explicitly directs readers to the relevant sections and tables. revision: yes
Circularity Check
No significant circularity in empirical claims or method definition
full rationale
The paper introduces GEPA as an algorithmic prompt optimizer that uses trajectory sampling, natural-language reflection, diagnosis, update proposals, testing, and Pareto-frontier combination. Its central claims are empirical performance comparisons (outperforming GRPO by 6% average / up to 20% and MIPROv2 by >10% across six tasks, with fewer rollouts). No equations, derivations, fitted parameters renamed as predictions, or self-citations that reduce results to tautological inputs appear in the provided text. The method is defined procedurally and evaluated against external baselines, rendering the claims self-contained rather than circular by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The interpretable nature of language provides a much richer learning medium for LLMs compared to policy gradients derived from sparse scalar rewards.
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel echoesWe argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards.
Forward citations
Cited by 40 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
SmellBench evaluates 11 LLM agent setups on 65 architectural smells, finding 47.7% best resolution rate, 63.1% false positives per experts, strong false-positive detection (κ=0.94), but aggressive repairs adding up to...
-
Harnessing Agentic Evolution
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
-
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
-
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
-
CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG
CDS4RAG cyclically optimizes full RAG hyperparameters by distinguishing and alternating between retriever and generator components, boosting performance up to 1.54x over prior methods on benchmarks.
-
SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.
-
Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs
A knowledge-first approach to LLM-driven automatic heuristic design in combinatorial optimization yields better discovery efficiency, transfer, and generalization than code-centric baselines by formalizing a distortio...
-
LLM-HYPER: Generative CTR Modeling for Cold-Start Ad Personalization via LLM-Based Hypernetworks
LLM-HYPER treats an LLM as a hypernetwork that outputs feature-wise weights for a linear CTR model from few-shot multimodal ad examples, achieving 55.9% better NDCG@10 than cold-start baselines and successful producti...
-
M$^\star$: Every Task Deserves Its Own Memory Harness
M* evolves distinct Python memory programs per task via population-based reflective search, outperforming fixed-memory baselines on conversation, planning, and reasoning benchmarks.
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
-
Meta-Harness: End-to-End Optimization of Model Harnesses
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
-
SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents
SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage poin...
-
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
-
PrismaDV: Automated Task-Aware Data Unit Test Generation
PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt ...
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents
Meta-prompt optimization enables LLM agents to discover stable, generalizable tacit collusion strategies in market simulations that outperform hand-crafted prompt baselines.
-
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.
-
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
-
Harnessing Pre-Resolution Signals for Future Prediction Agents
Milkyway evolves a future prediction harness using internal feedback from repeated predictions on the same unresolved question, achieving top scores on FutureX (44.07 to 60.90) and FutureWorld (62.22 to 77.96).
-
Agent-Aided Design for Dynamic CAD Models
AADvark extends agent-aided CAD design to dynamic 3D assemblies with movable parts by integrating constraint solvers and visual feedback to create a verification signal for the agent.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...
-
AI-Driven Research for Databases
Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.
-
Reflective Context Learning: Studying the Optimization Primitives of Context Space
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...
-
Self-Optimizing Multi-Agent Systems for Deep Research
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
-
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
-
EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents
EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.
-
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
-
A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis
The paper introduces a reproducible optimization protocol for prompt-based LLM workflows in evidence synthesis that separates task definitions from prompt harnesses, optimizes the harness against metrics and examples,...
-
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent his...
-
Harnessing Pre-Resolution Signals for Future Prediction Agents
Milkyway uses pre-resolution signals from temporal contrasts in evolving evidence and repeated forecasts to evolve a harness and improve predictions before resolution, outperforming baselines on FutureX and FutureWorld.
-
Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM
AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.
-
Supplement Generation Training for Enhancing Agentic Task Performance
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2507.14403. Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP.arXiv preprint arXiv:2212.14024, 2022. 16 Accepted at ICLR 2026 (Oral). Omar Khattab, Arnav Singhvi, Paridhi Mahes...
-
[2]
URLhttps://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/ alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/ AlphaEvolve.pdf. White paper. OpenAI. GPT-4.1 series, 2025. Large language model series, released April 2025.https://openai.com/ index/gpt-4-1/. Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Chri...
work page 2025
-
[3]
Optimizing instructions and demonstrations for multi-stage language model programs
Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.525. URLhttps: //aclanthology.org/2024.emnlp-main.525/. Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Re, and Azalia Mirho- seini. Kernelbench: Can LLMs write efficient GPU kernels? InScaling Self-Improving Foundation Models without Human Supervision,...
-
[4]
with rank dimension 16,α= 64, and dropout 0.05, using bf16 precision targeting the projection modules[q,k,v,o,up,down,gate]. We use a learning rate of1×10 −5,β= 0.01, reward scale normaliza- tion, and gradient norm clipping of 0.1. Gradients are accumulated for 20 steps before each update, with a “constant with warmup learning” rate scheduler. Non-reentra...
work page 2026
-
[5]
Privacy Preservation: - Do not include any user-specific or sensitive data in the external LLM request. - When queries contain location, dates, names, URLs, or other identifiable details, generalize or omit them in the LLM request. - Replace or abstract any private or potentially identifiable content with neutral placeholders or general terms without losi...
-
[6]
Query Understanding and Reformulation: - Analyze the user's query carefully to understand the underlying task or information need. - Determine if the query involves translation, event recommendations, advice, summarization, or other tasks. 34 Accepted at ICLR 2026 (Oral). - Identify when user-provided content is sensitive or proprietary (e.g., unique text...
work page 2026
-
[7]
- Retain the core informational or functional need so that the LLM can respond effectively
Quality Maximization: - Produce an LLM request that is clear, precise, and directs the LLM to perform the necessary task without requiring private context. - Retain the core informational or functional need so that the LLM can respond effectively. - When referencing external documents or URLs, do not include the actual link or private info; instead, reque...
-
[8]
- Request generalized or example-based information instead of specific user data
Common Strategies for Privacy-preserving Requests: - Use paraphrasing or abstraction for sensitive content. - Request generalized or example-based information instead of specific user data. - When translation is required, include only the text that needs translating if non-private, or otherwise generalize accordingly. - For location- or event-specific que...
-
[9]
Input Format: - A user query string possibly containing private or sensitive information
Reasoning Explanation: - For each transformation, produce clear reasoning explaining how privacy is preserved and why the LLM request fulfills the userâĂŹs intent without leaking sensitive information. Input Format: - A user query string possibly containing private or sensitive information. - Required output: a) A reasoning paragraph explaining your analy...
-
[10]
Privacy Preservation: - Do not include any user-specific details, personal names, locations, dates, URLs, or other potentially identifiable information in the outgoing LLM request. - When the user query contains private, sensitive, or proprietary data, you must generalize, abstract, or omit these details. 35 Accepted at ICLR 2026 (Oral). - Replace sensiti...
work page 2026
-
[11]
Understanding and Reformulation: - Carefully analyze the user's query to identify the underlying task type (e.g., summarization, creative writing, translation, profile writing, event recommendations, company background, academic content generation). - When reformulating, maintain the essential informational or functional need so that the external LLM can ...
-
[12]
Maximizing Quality of the Reformulated Prompt: - Construct clear, precise, and well-structured requests that explicitly guide the external LLM on the task. - Retain appropriate detail and context to ensure relevance, but balance this carefully against privacy concerns. - If referencing external documents, URLs, or institutions, do not include any links or...
-
[13]
- Use general descriptions or hypothetical/example-based requests where appropriate
Common Strategies for Privacy Preservation: - Paraphrase or abstract personal and sensitive content. - Use general descriptions or hypothetical/example-based requests where appropriate. - Omit or generalize specific names, dates, locations, institutions, or proprietary course codes. - When user content is extensive, focus on absorbing the core themes and ...
-
[14]
an interdisciplinary health minor
Explanation Requirement: - Provide a concise reasoning paragraph explaining how you identified sensitive or private details and the steps you took to protect user privacy. - Clarify how your reformulated LLM request retains the userâĂŹs original intent and task needs without risking data leakage. - This explanation is mandatory to document your privacy-pr...
work page 2026
-
[15]
37 Accepted at ICLR 2026 (Oral)
Privacy Preservation Principles: - Remove or replace all user-specific names (personal or organizational), exact dates or durations, locations, URLs, proprietary course or product codes, customer or client names, and other identifiers. 37 Accepted at ICLR 2026 (Oral). - When geographic or organizational mentions are critical for context, abstract these to...
work page 2026
-
[16]
Understanding and Reformulating the Task: - Identify the underlying task (creative writing, summarization, professional communication drafting, etc.) from the userâĂŹs query. - Preserve the functional intent and thematic requirements (e.g., content topics around sustainability, summary of a personâĂŹs background, professional email follow-up). - For user-...
-
[17]
- Avoid ambiguous or overly generic requests that might reduce relevance or usefulness
Maximizing Output Quality While Preserving Privacy: - Construct a prompt that is clear, precise, and contains sufficient context to enable comprehensive and relevant LLM output. - Avoid ambiguous or overly generic requests that might reduce relevance or usefulness. - Maintain a balance between detail necessary for quality and generalization required for privacy
-
[18]
Common Reformulation Strategies: - Replace specific names with generic role identifiers or placeholders (âĂa business contact,âĂİ âĂa notable individualâĂİ). - Replace specific locations or institutions with generalized descriptors, e.g., âĂa country known for eco-tourism.âĂİ - For time references, use relative or approximate terms without revealing expli...
-
[19]
- Explain how the essential task was preserved despite abstraction
Explanation Requirements: - Always provide reasoning that details how privacy risks were identified and mitigated. - Explain how the essential task was preserved despite abstraction. - This reasoning documents the privacy-preserving approach and justifies design choices. Examples and Common Pitfalls: - Do not retain or lightly obscure personal names or co...
work page 2026
-
[20]
Never lightly obscure or partially redact; full abstraction is required
Identification and Treatment of Sensitive Data: - All user-specific or personal names (individual or organizational) must be removed or replaced with generic role descriptors (e.g., âĂa business contact,âĂİ âĂa client,âĂİ âĂa notable individualâĂİ). Never lightly obscure or partially redact; full abstraction is required. 39 Accepted at ICLR 2026 (Oral). -...
work page 2026
-
[23]
Explanation Requirements: - The reasoning must transparently explain how privacy risks were identified (e.g., presence of names, locations, dates, proprietary terms). - It must describe the abstraction or omission methods applied (e.g., replacing âĂJonah Van BeijnenâĂİ with âĂa notable individual,âĂİ substituting âĂMakauâĂİ with âĂa specific region,âĂİ or...
work page 2026
-
[24]
**Input Understanding:** -`question`is the original multi-hop question posed by the user. -`summary_1`is a concise summary of information from a document retrieved in the first hop, which partially addresses the question
-
[25]
**Purpose and Context:** - Your generated`query`aims to find the *missing pieces* of information needed to fully answer the`question`. - The multi-hop retrieval system works in stages: - First hop: The original question returns some documents. - Second hop: Your query must help retrieve any *other relevant documents* NOT found in the first hop that hold c...
-
[26]
Madeira archipelago population in 2011
**Key Observations from Examples and Feedback:** - First-hop documents often cover one entity or aspect in the question. - Remaining relevant documents often involve connected or higher-level concepts mentioned in`summary_1`but not explicitly asked in the original question. - The`query`should be formulated to explicitly target these *missing*, but logical...
work page 2026
-
[27]
**How to Build the Query:** - Identify the entities or topics mentioned in`summary_1`that appear related but different from first-hop documents. - Reframe the query to explicitly mention these broader or related entities connected to the original question. - Include relevant key context from the question to maintain specificity, but shift focus to the mis...
-
[28]
**Practical Strategy:** - Read the`summary_1`carefully to spot references to bigger contexts or other entities not covered in the first hop. - Ask yourself, "What entity or aspect does this summary hint at that could answer the original question but was not found yet?" - Formulate a precise, focused factual query targeting that entity or concept to retrie...
-
[29]
**Output:** - Produce only the field`query`as a clear, concise question or keyword phrase designed for efficient retrieval of **second-hop documents**. - Ensure the query relates logically to the original question while targeting the broader or complementary knowledge identified in`summary_1`. - Do **not** include the original question or simply rephrase ...
work page 2026
-
[30]
**Understand the question precisely.** Determine exactly what is askedâĂŤwhether a name, a specific fact, a date, or a yes/no answer
-
[31]
**Compare both summaries.** Analyze the content of`summary_1`and`summary_2`: - If they agree and directly answer the question, use this as primary evidence. - If one summary provides a fact that the other does not mention, carefully evaluate its plausibility. - If the summaries conflict, use domain expertise and authoritative knowledge to resolve or expli...
-
[32]
**Domain-specific factual verification and nuance:** 44 Accepted at ICLR 2026 (Oral). - **Names and nicknames:** Provide only the specific nickname or name when asked, without extra phrasing. For example, when asked for the nickname of a person or entity, respond with the nickname alone, not a full sentence. - **Nationality and identity distinctions:** Us...
work page 2026
-
[33]
- Avoid repeating or restating the question
**Answer conciseness and relevance:** - Provide a brief and direct answer to the question. - Avoid repeating or restating the question. - Avoid unnecessary context unless requested or needed for clarity. - Avoid constructing full sentences unless needed; for example, answers to nickname or yes/no questions should be as short and specific as possible
-
[34]
**When authoritative knowledge supplements the summaries:** - If the summaries are incomplete or potentially inaccurate, incorporate trusted knowledge from your training about the topic to provide the correct and precise answer. - For example, when a summary gives a year that conflicts with known release dates or factual details, prefer the verified date....
-
[35]
Chiesa di Filippini Madonna di Galliera e Filippo Neri
**Examples of correct reasoning and answers:** - Question: âĂWhat is the nickname of the 2005 Toyota Grand Prix of Long Beach Polesitter ?âĂİ - Correct answer:`the thrill from West Hill` - Question: âĂWhat type of company is Zipcar led by Scott Griffith from 2003-2013?âĂİ - Correct answer:`car-sharing company` - Question: âĂWho was the partner of British ...
work page 2005
-
[36]
Directly relates to the initial question
-
[37]
Captures the core facts and entities needed to understand the scope and context of the question
-
[38]
Includes relevant connections, bridging entities, dates, locations, or descriptions that enable the system to devise focused and effective follow-up queries in subsequent hops
-
[39]
Provides a strong factual foundation for downstream answer generation modules. **Task specifics and best practices:** - The`summary`must represent a distilled synthesis, not just a compression or extractive snippet. - Explicitly include cited passage titles or key entity labels (e.g., "Children in Need 2006 | ..." or "Anthony Levandowski | ...") in your s...
work page 2006
-
[40]
**Objective:** Your query must target documents not retrieved in the first hop, using clues from the summary and the original question
-
[41]
**Key Strategy:** - Identify gaps in the first hop's retrieved documents (e.g., missing entities, relationships, or specific details). - Use explicit information from the summary (e.g., names, locations, quantities) to rephrase the question into a query that surfaces new relevant documents. - Avoid restating the answer directly; instead, structure the que...
-
[42]
What is the headquarters location of [Company]?
**Domain-Specific Guidance:** - If the summary explicitly answers the question, the query should still focus on retrieving documents that provide deeper context or verify the answer (e.g., "What is the headquarters location of [Company]?" instead of "The answer is [Location]"). - Leverage entities mentioned in the summary (e.g., "Carhartt," "Aubrey O'Day"...
-
[43]
- Assuming the summary contains all necessary information for the second hop
**Avoid:** - Generating queries that duplicate the original question. - Assuming the summary contains all necessary information for the second hop. HotpotQA Qwen3 8B final_answer.predict Base Prompt: Given the fields`question`,`summary_1`,`summary_2`, produce the fields`answer`. MIPROv2 Prompt: Given the question, summary_1, and summary_2, generate a step...
work page 2026
- [44]
-
[45]
**Resolving ambiguity**: If the question references a title, historical role, or specific designation (e.g., "second Duke of Florence"), prioritize contextual or historical clues from the summaries to infer the correct answer, even if the exact term is not explicitly stated. Use domain-specific knowledge (e.g., Medici family lineage) to fill gaps when sum...
-
[46]
**Cross-referencing summaries**: Ensure consistency between summaries. If summaries conflict, prioritize the one with explicit factual claims (e.g., numerical data, direct statements). If no explicit claim exists, synthesize information while ensuring alignment with historical, political, or cultural context
-
[47]
Do not add context, explanations, or external knowledge beyond what is explicitly provided
**Avoiding overgeneralization and extra information**: Focus strictly on the most specific and directly stated information in the summaries. Do not add context, explanations, or external knowledge beyond what is explicitly provided. For example, if the question asks for a year, provide only the year; do not include band member details or historical background
-
[48]
**Prioritizing factual alignment**: If a summary explicitly states the answer, use that. If summaries are indirect or vague, synthesize information while ensuring alignment with factual knowledge (e.g., linking "Path to Prosperity" to Rep. Paul RyanâĂŹs Medicare proposal). **Key adjustments based on feedback**: - **Conciseness**: Answers must be strictly ...
work page 2026
-
[49]
is a German author, philosopher, academic and film director.','Cattle King | Cattle King is a 1963 film directed by Tay Garnett. It stars Robert Taylor and Robert Loggia. It also appears to have been called Guns of Wyoming in some countries.', 'GÃnther von Kluge | GÃnther von Kluge (30 October 1882 âĂŞ 19 August 1944) was a German field marshal during Wor...
work page 1963
-
[50]
**Extracts direct answers** from the top retrieved passages to address the question
-
[51]
**Identifies and highlights missing or implied clues** that may require further retrieval (e.g., entities, connections, or contextual details)
-
[52]
**Synthesizes information** by combining explicit facts from the passages with domain-specific knowledge or logical inferences to guide subsequent steps. ### **Summary Structure** - **Entity/Person Mention**: Clearly state the subject (e.g., "Billy Truax", "Eintracht Braunschweig") and include **full names, titles, or official designations** (e.g., "Thoma...
work page 1943
-
[53]
**Explicit Answers First**: Prioritize explicitly stated facts from the context and passages (e.g., direct mentions of entities, roles, or relationships)
-
[54]
**Infer or Generalize When Necessary**: If critical details are missing from the passages, infer connections or generalize based on contextual clues and domain-specific knowledge (e.g., linking ownership structures, roles, or historical context)
-
[55]
**Bridge Gaps**: Ensure the summary includes all **key supporting information** required to answer the question, even if it is not explicitly stated in the input. For example: - If the answer is "Newcastle United," include details about Sports Direct's ownership and the connection to the billionaire. - If the answer is a person's role (e.g., "troubleshoot...
-
[56]
Stan Kroenke owns Sports Direct and Arsenal F.C
**Structure and Precision**: - Clearly connect entities, roles, and relationships (e.g., "Stan Kroenke owns Sports Direct and Arsenal F.C."). - Avoid ambiguity by including all necessary contextual links (e.g., "Mike Ashley founded Sports Direct and owns Newcastle United"). - Use precise terminology and ensure alignment with domain-specific knowledge (e.g...
-
[57]
**Domain-Specific Knowledge**: Leverage implicit domain knowledge when passages lack critical details (e.g., knowing that "Project RAND" is linked to Henry H. Arnold and the RAND Corporation). ### Example Integration: If the question is about a person's profession in a novel, ensure the summary includes: - The character's name. - Their profession (explici...
work page 2026
-
[58]
Carefully interpret the mathematical or logical problem
-
[59]
Show your reasoning internally to confirm the final answer (reasoning does not need to be included in the response unless explicitly requested)
-
[60]
Provide the final direct response strictly following all instructions, especially when asked to repeat the query verbatim first before giving the answer. General approach: - Always parse the query thoroughly to extract every constraint and instruction. - Ensure your response exactly matches the format, wording, and content as instructed. - Do not invent o...
work page 2026
-
[61]
- Specific length constraints (number of sentences, bullet points, word counts)
**Query Parsing and Extraction of Instructions** - Carefully read the entire query to identify all explicit instructions concerning: - Whether and how to repeat the query text (verbatim or partially). - Specific length constraints (number of sentences, bullet points, word counts). - Formatting instructions (e.g., capitalization requirements, quotation mar...
-
[62]
- Do not prepend or append anything to the repeated text unless explicitly instructed
**Exact Text Reproduction** - When asked to repeat the query text (or any other required phrase) verbatim, do so with zero changes âĂŤ no added or removed words, punctuation, or formatting. - Do not prepend or append anything to the repeated text unless explicitly instructed. - Preserve all original capitalization, spacing, and punctuation exactly as in the query
-
[63]
- Using specified markdown bullet point styles (e.g., asterisks)
**Structural and Formatting Compliance** - Follow all formatting instructions strictly, such as: - Wrapping the entire response in quotation marks if required. - Using specified markdown bullet point styles (e.g., asterisks). - Ensuring capitalization instructions (e.g., all caps or minimum occurrences of uppercase words) are perfectly met. - Adhering to ...
-
[64]
- Use domain knowledge and reliable calculations to ensure factual correctness in answers
**Response Content Accuracy and Appropriateness** - After fulfilling all structural requirements, respond to the main substantive question accurately and completely. - Use domain knowledge and reliable calculations to ensure factual correctness in answers. - For questions requesting sensitive or potentially harmful content (e.g., cures without scientific ...
work page 2026
-
[65]
**No Extraneous Text** - Do not add explanations, internal reasoning, apologies, or meta commentary beyond what the query explicitly permits or demands. - Your final output must be the exact, ready-to-deliver response that meets all user instructions perfectly
-
[66]
**Examples and Patterns Observed** - Users often combine multiple complex formatting and content instructions (e.g., repetition of request text, followed by specific number of sentences or bullet points, with capitalization rules). - Ensure you carefully distinguish when to repeat the query text verbatim and when to respond directly (sometimes the repetit...
work page 2026
-
[67]
**First**, repeat the user's query **word for word** without any changes or additions
-
[68]
**Then**, provide your answer in the specified format, adhering to all constraints (e.g., markdown, structure, content)
-
[69]
**Do not include any additional text, explanations, or formatting** beyond the repeated query and your answer
-
[70]
**Include niche/domain-specific factual details** (e.g., technical commands, best practices, or platform-specific configurations) if applicable, as these are critical for accurate task completion
-
[71]
**Use precise formatting** (e.g., bullet points, code blocks, headers) as requested, ensuring no markdown is omitted or altered
-
[72]
**Avoid generalizable strategies** unless explicitly instructed; focus on actionable, specific guidance
-
[73]
**Validate all technical steps** (e.g., Dockerfile syntax, CLI commands) for accuracy and completeness
-
[74]
**Highlight potential pitfalls and solutions** to address common issues in the task
-
[75]
**Prioritize clarity and conciseness**, ensuring the response is both comprehensive and easy to follow
-
[76]
62 Accepted at ICLR 2026 (Oral)
**Adhere to language and case requirements** (e.g., all caps, English only) if specified. 62 Accepted at ICLR 2026 (Oral). L.5 HOVER, GPT-4.1 MINI HoVer GPT-4.1 Mini create_query_hop2.predict Base Prompt: Given the fields`claim`,`summary_1`, produce the fields`query`. MIPROv2 Prompt: Given the original claim and the initial summary of retrieved evidence, ...
work page 2026
-
[77]
**Extract key factual elements from the claim** âĂŤ names, dates, titles, roles, events, or relationships explicitly or implicitly stated
-
[78]
**Contrast these facts with the summary to identify points of agreement, contradiction, or ambiguity.**
-
[79]
**Formulate fact-checking queries that are:** - Tightly focused on the core factual issues raised by the claim and addressed or contradicted by the summary. - Include named entities, dates, roles, or other domain-specific identifiers directly mentioned in both claim and summary to improve retrieval effectiveness. - When relevant, break complex claims into...
-
[80]
**When relevant details appear only in the summary but are hinted at or missing from the claim (e.g., specific titles, roles, or names), include these in the queries to enable retrieval of key evidence.**
-
[81]
**Use a clear, natural question format or targeted keyword phrases that could serve well as search queries.**
-
[82]
**Avoid overly broad or generic queries; precision improves evidence retrieval quality.**
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.