Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

Brejesh Lall; Jyotirmoy Nath; Neeraj Kumar

arxiv: 2605.28360 · v1 · pith:XAZ7KIAKnew · submitted 2026-05-27 · 💻 cs.AI

Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

Jyotirmoy Nath , Neeraj Kumar , Brejesh Lall This is my paper

Pith reviewed 2026-06-29 12:23 UTC · model grok-4.3

classification 💻 cs.AI

keywords prompt optimizationcompositional learningdiscrete codebooksinstruction refinementtextual gradientsmin-max optimizationlanguage model agentsper-instance routing

0 comments

The pith

Prompt Codebooks recasts automatic prompt optimization as discrete learning over a finite vocabulary of reusable natural-language instincts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing methods optimize each task prompt as one monolithic string through global edits, which produces brittle updates and blocks reuse of learned sub-behaviors across different inputs. Prompt Codebooks instead stores prompt-construction knowledge in a discrete codebook of atomic instincts and routes every input instance to a small subset of them. An LLM encoder selects the instincts, a generator composes them into a prompt for the frozen target model, and a critic supplies structured verdicts that decompose into per-variable textual gradients. These components train jointly under a language-valued min-max objective. The approach yields per-instance routing that instance-blind methods cannot express, and on six benchmarks with Qwen3-8B and LLaMA-3.1-8B it records gains up to 30.36 points over zero-shot while cutting deployed prompt length by up to 14.1 times.

Core claim

The paper claims that organizing prompt knowledge into a discrete codebook of instincts, routing each input to a small subset via an LLM encoder, composing them with a generator, and training the whole system with a critic under a language-valued min-max objective produces more effective and shorter prompts than monolithic optimization, delivering up to 30.36-point gains over zero-shot and length reductions of 14.1x versus MIPROv2 using only 16 instincts.

What carries the argument

The Prompt Codebook: a finite vocabulary of natural-language instincts together with an LLM encoder that routes inputs to small subsets, a generator that composes them into prompts, and a critic that emits structured verdicts for joint optimization under a language-valued min-max objective.

If this is right

Per-instance routing lets different inputs inside the same task receive different instinct compositions, a regime impossible for instance-blind methods.
Performance improves up to 30.36 points over zero-shot and 3.34 points over GEPA on HotpotQA across Qwen3-8B and LLaMA-3.1-8B.
Deployed prompt length drops by up to 14.1 times versus MIPROv2 and 3.0 times versus GEPA while using only K=16 instincts.
The codebook structure separates reusable sub-behaviors from instance-specific selection, enabling reuse across tasks without re-optimizing entire prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same codebook-plus-encoder pattern could be applied to other adaptive-instruction settings such as tool-use chains or multi-agent coordination.
Because instincts remain explicit natural-language units, the learned codebook may support human inspection and manual editing of individual sub-behaviors.
If routing remains stable across model scales, the approach could reduce the need to store or transmit long task-specific prompts at inference time.

Load-bearing premise

The LLM encoder can reliably route each input to an effective small subset of instincts and the critic's structured verdicts supply usable textual gradients that jointly improve the encoder, generator, and codebook.

What would settle it

A controlled experiment on a held-out benchmark or model in which PCO fails to exceed the strongest baseline in aggregate accuracy or fails to produce shorter effective prompts than GEPA would falsify the performance and compression claims.

Figures

Figures reproduced from arXiv: 2605.28360 by Brejesh Lall, Jyotirmoy Nath, Neeraj Kumar.

**Figure 2.** Figure 2: PCO inference pipeline (HoVer, Qwen3-8B). [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Codebook usage for full PCO (blue) vs. with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: illustrates the evolution of encoder routing probabilities across training. The adaptive PCO encoder (Figure 4a) progressively concentrates routing toward high-performing instincts while preserving exploration through ε-greedy sampling. By contrast, the static-routing ablation (Figure 4b) produces nearly invariant selection patterns across epochs, indicating that learnable routing is necessary for task… view at source ↗

read the original abstract

Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task's prompt as a monolithic, instance-blind string optimized through global edits, producing brittle updates and preventing the reuse of learned sub-behaviors. We propose Prompt Codebooks (PCO), a novel compositional prompt optimization framework that recasts APO as discrete learning over a finite vocabulary of natural-language instincts - atomic, reusable instruction units. PCO organizes prompt-construction knowledge in a discrete codebook and routes each input to a small subset of entries via an LLM-based encoder; a generator composes them into a prompt for the frozen target model; a critic emits a structured verdict that decomposes by attribution into per-variable textual gradients, jointly training the encoder, generator, and codebook under a language-valued min-max objective. The resulting routing is per-instance: different inputs in the same task receive different instinct compositions, a regime structurally inexpressible under instance-blind methods. Across six benchmarks on Qwen3-8B and LLaMA-3.1-8B, PCO improves over zero-shot by up to +30.36 points, surpasses the strongest prior baseline (GEPA) by +3.34 on HotpotQA and +1.11 in aggregate, and reduces deployed prompt length by up to 14.1x versus MIPROv2 and 3.0x versus GEPA using only K=16 instincts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PCO frames prompt optimization as learning a small codebook of reusable instincts with per-instance LLM routing, which is structurally different from monolithic baselines and yields shorter prompts plus modest gains.

read the letter

The core idea is to stop treating each prompt as one big string and instead learn a fixed set of 16 atomic natural-language instructions, then let an encoder pick a small subset for each input and compose them on the fly. That per-instance routing is the part that prior global-edit methods like GEPA and MIPROv2 cannot express.

The length reductions stand out: up to 14x shorter than MIPROv2 and 3x shorter than GEPA while still beating the strongest baseline by 1.11 points on average and 3.34 on HotpotQA. The zero-shot lifts are larger, which is expected, but the fact that they keep the target model frozen and only train the encoder, generator, and codebook is a clean setup.

The soft spot is the training loop. The abstract describes a critic that emits structured verdicts turned into textual gradients under a language-valued min-max objective, but supplies no ablations on whether those gradients actually move the codebook entries, how stable the min-max game is, or what happens when the encoder routes poorly. Without those checks it is hard to know whether the reported numbers come from the compositional mechanism or from extra optimization effort.

The citation pattern is standard and the claims rest on external baselines rather than self-reference, so there is no obvious circularity. The weakest assumption is that the LLM encoder can consistently select useful instinct subsets and that the critic feedback is informative enough to improve all three components jointly.

This is for groups already running prompt-optimization experiments who want a more modular alternative. It is worth sending to review because the framing is distinct and the empirical claims are specific enough to test, even if the method section will need close reading.

Referee Report

0 major / 1 minor

Summary. The paper proposes Prompt Codebooks (PCO), a compositional automatic prompt optimization (APO) framework that represents prompt knowledge as a discrete codebook of K natural-language 'instincts.' An LLM encoder performs per-instance routing to a small subset of instincts; a generator composes them into a prompt for a frozen target LLM; and a critic produces structured, attribution-decomposed verdicts that supply textual gradients for joint min-max training of the encoder, generator, and codebook. Empirical claims include gains of up to +30.36 points over zero-shot, +3.34 on HotpotQA and +1.11 aggregate over GEPA, and prompt-length reductions of up to 14.1x vs. MIPROv2, all on Qwen3-8B and LLaMA-3.1-8B across six benchmarks using only K=16 instincts.

Significance. If the empirical results and training procedure hold under scrutiny, the work would be significant for introducing the first instance-specific, reusable compositional mechanism in APO, a regime that monolithic global-edit methods cannot express. The language-valued min-max objective and per-variable textual gradients constitute a concrete technical contribution that could influence future discrete optimization approaches for LLMs.

minor comments (1)

The abstract states concrete performance numbers, length reductions, and comparisons to GEPA/MIPROv2, yet supplies no experimental protocol, ablation design, statistical tests, or implementation details on the min-max objective; the manuscript must include these to allow evaluation of the data-to-claim link.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of our work and for acknowledging its potential significance in introducing the first instance-specific, reusable compositional mechanism in automatic prompt optimization. The recommendation of 'uncertain' appears to stem from the absence of detailed major comments in the report. Below we provide point-by-point responses where applicable; since no specific major comments were enumerated, we focus on clarifying the core technical and empirical elements that may underlie the uncertainty.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims consist of empirical performance gains on external benchmarks (Qwen3-8B, LLaMA-3.1-8B, HotpotQA, etc.) against independent baselines (GEPA, MIPROv2, zero-shot). No derivation, equation, or optimization step is shown that reduces by construction to a fitted parameter, self-citation, or renamed input; the codebook, encoder, and critic are presented as a new construction whose value is measured by held-out task accuracy and prompt length, not by internal identity. The abstract and reader's assessment confirm the results rest on falsifiable external comparisons rather than self-referential fitting.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the learnability of atomic instincts and the capacity of LLMs to perform encoding, generation, and criticism roles; K=16 is an explicit hyperparameter choice.

free parameters (1)

K = 16
Codebook size fixed at 16 instincts for all reported experiments.

axioms (1)

domain assumption LLMs can function as reliable encoders for routing, generators for composition, and critics providing attributional textual gradients.
The joint training procedure depends on these LLM capabilities.

invented entities (1)

instincts no independent evidence
purpose: Atomic reusable natural-language instruction units that serve as the discrete vocabulary for prompt construction.
Core new construct enabling the compositional regime.

pith-pipeline@v0.9.1-grok · 5800 in / 1376 out tokens · 53695 ms · 2026-06-29T12:23:37.241594+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

[1]

InInternational Conference on Machine Learning (ICML)

Wasserstein generative adversarial networks. InInternational Conference on Machine Learning (ICML). Sara Câmara, Eduardo Luz, Valéria Carvalho, Ivan Meneghini, and Gladston Moreira. 2025. Moprompt: Multi-objective semantic evolution for prompt opti- mization.arXiv preprint arXiv:2508.01541. Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman...

work page arXiv 2025
[2]

High Fidelity Neural Audio Compression

Trace is the next autodiff: Generative opti- mization with rich feedback, execution traces, and llms.(NeurIPS). Alexandre Defossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. High fidelity neural audio compres- sion.arXiv preprint arXiv:2210.13438. Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric P. Xi...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Generating diverse high-fidelity images with vq-vae-2.NeurIPS. Zhihong Shao and 1 others. 2024. Deepseekmath: Push- ing the limits of mathematical reasoning in open lan- guage models.arXiv preprint arXiv:2402.03300. Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting knowledge from language models ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

InInternational Conference on Machine Learning (ICML)

Wasserstein generative adversarial networks. InInternational Conference on Machine Learning (ICML). Sara Câmara, Eduardo Luz, Valéria Carvalho, Ivan Meneghini, and Gladston Moreira. 2025. Moprompt: Multi-objective semantic evolution for prompt opti- mization.arXiv preprint arXiv:2508.01541. Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman...

work page arXiv 2025

[2] [2]

High Fidelity Neural Audio Compression

Trace is the next autodiff: Generative opti- mization with rich feedback, execution traces, and llms.(NeurIPS). Alexandre Defossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2022. High fidelity neural audio compres- sion.arXiv preprint arXiv:2210.13438. Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric P. Xi...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Generating diverse high-fidelity images with vq-vae-2.NeurIPS. Zhihong Shao and 1 others. 2024. Deepseekmath: Push- ing the limits of mathematical reasoning in open lan- guage models.arXiv preprint arXiv:2402.03300. Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting knowledge from language models ...

work page internal anchor Pith review Pith/arXiv arXiv 2024