arxiv: 2603.25169 · v2 · submitted 2026-03-26 · 💻 cs.CL

Recognition: no theorem link

To Write or to Automate Linguistic Prompts, That Is the Question

Marina S\'anchez-Torr\'on , Daria Akselrod , Jason Rauchwerk

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords prompt optimizationDSPyGEPAexpert promptstranslationterminology insertionlanguage quality assessmentLLM evaluation

0 comments

The pith

GEPA optimization on DSPy signatures produces results comparable to expert hand-crafted prompts across most linguistic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper systematically compares three prompt approaches—hand-crafted zero-shot expert prompts, minimal DSPy signatures, and GEPA-optimized DSPy signatures—on translation, terminology insertion, and language quality assessment using five model setups. It reports that GEPA lifts the base signatures and that the majority of head-to-head expert-versus-optimized comparisons show no statistically significant quality difference. Performance varies by task: the two prompt types are largely equivalent on terminology insertion, split by model on translation, and divided on language quality assessment where experts detect errors better while optimization improves characterization. The evaluation is asymmetric because GEPA searches over labeled gold-standard data while expert prompts rely only on domain knowledge and iteration without such data. A sympathetic reader would care because prompt design is labor-intensive, so evidence that automation can match expert output would change how linguistic applications of LLMs are built.

Core claim

The central claim is that across all tasks GEPA elevates minimal DSPy signatures such that the majority of expert-optimized comparisons exhibit no statistically significant difference, even though the setup grants the optimizer access to gold-standard splits that the expert prompts forgo.

What carries the argument

GEPA optimization of DSPy signatures, measured against zero-shot expert prompts in translation, terminology insertion, and language quality assessment.

If this is right

GEPA optimization reliably improves performance over minimal DSPy signatures in every tested linguistic task.
Terminology insertion quality is mostly indistinguishable between optimized and manual prompts.
Translation results split by model, with each prompt type winning on different configurations.
Expert prompts outperform on error detection in language quality assessment while optimization improves error characterization.
Most direct expert-optimized comparisons lack statistical significance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If experts were also given labeled data for refinement, the remaining gaps might shrink or reverse.
Production NLP pipelines could shift from manual prompt writing to automated optimization loops for new domains.
The task dependence implies that hybrid workflows—expert guidance for some tasks, full automation for others—may be optimal.
The same comparison framework could be extended to few-shot settings or additional languages to test generality.

Load-bearing premise

The assumption that an asymmetric setup—GEPA searching over labeled gold-standard data while expert prompts receive none—still allows a fair test of prompt quality.

What would settle it

A follow-up experiment in which expert prompt writers are also given the same labeled splits for refinement and the statistical-significance results are re-checked.

read the original abstract

LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations. Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment tasks, using five model configurations. Results are task-dependent: optimized and manual prompts are mostly statistically indistinguishable in terminology insertion; different approaches win on different models in translation; expert prompts are stronger for error detection while optimization improves characterization in LQA. Across tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. The authors note an asymmetry where GEPA searches over gold-standard splits while expert prompts use no labeled data.

Significance. If the results hold under a balanced comparison, this work would be significant as the first empirical head-to-head evaluation of automatic prompt optimization versus expert engineering in linguistic tasks. It provides evidence that GEPA can elevate base signatures to levels comparable with experts in several settings, which could reduce reliance on manual prompt crafting in NLP pipelines. The task-dependent patterns offer practical guidance on when automation succeeds. However, the acknowledged asymmetry in data access during optimization substantially weakens the strength of the equivalence claims.

major comments (2)

[Abstract] Abstract: The central claim that 'the majority of expert-optimized comparisons show no statistically significant difference' cannot be evaluated because the abstract (and by extension the reported results) provides no details on the exact metrics used (e.g., BLEU, accuracy, or F1), the statistical tests performed, sample sizes, effect sizes, or p-value thresholds. This absence leaves the support for the no-difference conclusion difficult to assess and is load-bearing for the headline result.
[Abstract] Abstract: The experimental setup is asymmetric in a way that directly undermines the fairness of the comparison underlying the no-significant-difference claim. GEPA optimization searches over gold-standard splits and thus has supervised access to the evaluation distribution, while expert prompts are constructed with zero labeled data. This violates the ceteris-paribus assumption required to interpret parity as evidence that automation can replace expert engineering; the observed equivalence may partly be an artifact of GEPA's access to test labels rather than intrinsic prompt quality.

minor comments (2)

[Abstract] The abstract refers to 'five model configurations' and 'statistical comparisons' without naming the models or providing any concrete performance numbers; adding at least one illustrative example (model names plus a sample metric value) would improve immediate clarity.
Ensure the full results section reports all comparisons with effect sizes and confidence intervals in addition to p-values so that readers can judge practical as well as statistical significance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below, proposing targeted revisions to improve clarity and balance in the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the majority of expert-optimized comparisons show no statistically significant difference' cannot be evaluated because the abstract (and by extension the reported results) provides no details on the exact metrics used (e.g., BLEU, accuracy, or F1), the statistical tests performed, sample sizes, effect sizes, or p-value thresholds. This absence leaves the support for the no-difference conclusion difficult to assess and is load-bearing for the headline result.

Authors: We agree that the abstract lacks sufficient methodological detail to support the central claim. In the revised version we will expand the abstract to specify the metrics (BLEU for translation, accuracy for terminology insertion, F1/accuracy for LQA), the statistical procedure (paired t-tests), sample sizes drawn from the evaluation sets, the p-value threshold (0.05), and a brief reference to effect sizes for the no-significant-difference comparisons. These additions will make the headline result directly evaluable from the abstract. revision: yes
Referee: [Abstract] Abstract: The experimental setup is asymmetric in a way that directly undermines the fairness of the comparison underlying the no-significant-difference claim. GEPA optimization searches over gold-standard splits and thus has supervised access to the evaluation distribution, while expert prompts are constructed with zero labeled data. This violates the ceteris-paribus assumption required to interpret parity as evidence that automation can replace expert engineering; the observed equivalence may partly be an artifact of GEPA's access to test labels rather than intrinsic prompt quality.

Authors: We acknowledge that the asymmetry is a substantive limitation that tempers the strength of the equivalence claims. The manuscript already flags this point, but we will substantially expand the discussion section to (1) restate the asymmetry explicitly, (2) clarify that GEPA's use of gold-standard splits mirrors realistic development-set optimization while expert prompts remain zero-shot, and (3) qualify the interpretation of parity as evidence that automation can match expert performance when labeled data are available for search. We will avoid stronger language implying full replacement of expert engineering. A fully symmetric re-experiment is not feasible within the current study scope, but the revised text will present the results with the appropriate caveats. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical prompt comparison

full rationale

This is a purely empirical study comparing three prompt sources (expert zero-shot, base DSPy, GEPA-optimized) on external task metrics across translation, terminology insertion, and LQA. No equations, derivations, or parameter-fitting steps are present that could reduce any result to its own inputs by construction. The authors explicitly flag the asymmetric label access and report statistical tests against held-out performance; outcomes remain falsifiable on independent data. No self-citation chains, ansatzes, or renamings of known results appear in the load-bearing claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical experimental study; it relies on standard assumptions of statistical hypothesis testing and experimental design in machine learning rather than new theoretical axioms or invented entities.

axioms (1)

standard math Standard statistical significance testing applies to prompt quality comparisons
Invoked when stating no statistically significant difference in most comparisons

pith-pipeline@v0.9.0 · 5457 in / 1238 out tokens · 64674 ms · 2026-05-15T01:07:40.680194+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.