arxiv: 2605.02363 · v1 · submitted 2026-05-04 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

Cosimo Galeone, Daniele Ligorio, Giuseppe Ettorre, Minsu Park

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords structured outputsprompt optimizationsmall language modelsJSON formatmathematical benchmarksoutput accuracyblack-box accesssystem prompts

0 comments

The pith

Small language models reach 84-87% valid structured outputs on math benchmarks when system prompts are discovered by an automated meta-agent optimizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models must produce answers that are both correct and formatted exactly as required for downstream use, yet common prompting methods yield high task accuracy while producing zero valid structured results on strict benchmarks. The paper demonstrates this gap using GSM8K and MATH with a JSON output contract, where naive prompting and simple reference prompts both fail to deliver usable outputs despite correct reasoning in many cases. An iterative optimizer that refines system prompts through repeated black-box trials with the target model raises output accuracy to 84-87% on GSM8K and 34-40% on MATH across multiple runs. The gains hold with near-native inference speed and without any model changes, and the same pattern appears even in a closed-source large model.

Core claim

The paper shows that an iterative system-prompt optimizer called AloLab, driven by a capable meta-agent with only black-box API access to the target model, can discover prompts that make small language models generate outputs that are simultaneously mathematically correct and valid JSON on GSM8K and MATH, attaining 84-87% output accuracy on GSM8K and 34-40% on MATH over five independent runs, with 29 of 30 paired comparisons against the best static prompt reaching statistical significance at p less than 0.05, all while keeping inference latency close to the naive baseline and without fine-tuning.

What carries the argument

AloLab, an iterative system-prompt optimizer that proposes, tests, and refines prompt variations for a target model using only black-box output feedback from benchmark runs.

If this is right

Output accuracy, the joint requirement of task correctness and format validity, becomes the operative metric and exposes failures that task accuracy alone hides.
Constrained decoding is no longer necessary for format compliance, avoiding its 3.6x to 8.2x latency cost and occasional task-performance degradation.
The same format non-compliance pattern occurs in both open small models and closed large models such as GPT-4o, where reference prompting yields zero valid outputs.
Optimization quality scales with meta-agent capability, as shown by the drop in mean accuracy and rise in variance when a weaker meta-agent is substituted.
No model fine-tuning or internal access is required, so the method applies directly to any model available through an API.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same black-box iterative search could be applied to other structured-output contracts such as code snippets or tabular data without changing the core procedure.
Production systems could run periodic re-optimization on fresh data distributions to keep prompts aligned with evolving model behavior or task requirements.
Lower performance on the harder MATH benchmark relative to GSM8K suggests the method may need longer optimization budgets or richer feedback signals when task complexity increases.
Combining the discovered prompts with light post-processing or verification steps might push output accuracy still higher while retaining the latency advantage.

Load-bearing premise

The meta-agent can discover system prompts that generalize to new examples rather than overfitting to the specific problems seen during optimization.

What would settle it

Applying the final optimized prompt to a fresh collection of math problems never used in the optimization loop and measuring whether output accuracy falls back to the near-zero levels observed with reference prompts.

Figures

Figures reproduced from arXiv: 2605.02363 by Cosimo Galeone, Daniele Ligorio, Giuseppe Ettorre, Minsu Park.

**Figure 1.** Figure 1: AloLab iterative prompt optimization pipeline. Each epoch: the Solver queries the target view at source ↗

**Figure 2.** Figure 2: Cost-effectiveness: output accuracy vs. relative inference latency (normalized to NAIVE) view at source ↗

**Figure 3.** Figure 3: Validation output accuracy over optimization epochs for all models and datasets (GSM8K view at source ↗

**Figure 4.** Figure 4: GSM8K output accuracy stratified by estimated solution complexity (number of reasoning view at source ↗

**Figure 5.** Figure 5: AloLab MATH output accuracy heatmap by subject and difficulty level for each model. view at source ↗

**Figure 6.** Figure 6: Meta-agent quality ablation: validation output accuracy per epoch for Llama 3.1-8B view at source ↗

read the original abstract

Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks -- GSM8K and MATH -- as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required fields). We evaluate three 7-9B models under five prompting strategies and report output accuracy -- the joint event of mathematical correctness and valid JSON structure -- as the primary metric. A systematic format failure emerges: NAIVE prompting (no system prompt) achieves up to 85% task accuracy on GSM8K but 0% output accuracy across all models and datasets. REFERENCE prompting (a minimal hand-written JSON format prompt) fares little better, yielding 0% output accuracy for two of four models tested. Constrained decoding enforces syntactic validity but incurs 3.6x-8.2x latency overhead and in several settings degrades task performance substantially. To overcome this limitation, we developed AloLab, an iterative system-prompt optimizer (meta-agent: Claude Sonnet 4.5) requiring only black-box API access to the target model; it reaches 84-87% output accuracy on GSM8K and 34-40% on MATH across five independent runs per model, with 29/30 paired McNemar comparisons against the best static prompt significant at p < 0.05, at near-NAIVE inference latency and without model fine-tuning. The same format failure extends to GPT-4o (OpenAI, 2024), a proprietary closed-source model: REFERENCE achieves 0% output accuracy due to systematic markdown-fence wrapping, while AloLab reaches 95.2% [94.8, 95.6]. An ablation replacing the Sonnet 4.5 meta-agent with Claude 3 Haiku reduces mean output accuracy to 61.0% and increases run-to-run standard deviation from <1 pp to 21.8 pp, confirming that meta-agent capability is a primary driver of optimization quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AloLab's meta-agent prompt search lifts small-model JSON output accuracy on GSM8K and MATH but the gains rest on test-set optimization with no held-out checks.

read the letter

The main point is that this paper gives a workable recipe for fixing format failures in small models on structured math tasks without fine-tuning or heavy decoding overhead. Their AloLab loop uses a strong external meta-agent to iteratively rewrite system prompts through black-box calls, pushing joint correctness-plus-valid-JSON rates to 84-87% on GSM8K and 34-40% on MATH across five runs, with most comparisons to static prompts statistically significant. That beats naive prompting (which hits 0% output accuracy) and reference prompts while staying near baseline speed, and the ablation with a weaker meta-agent shows the quality of the optimizer matters a lot. They also note the same format issue appears in GPT-4o, which is useful context. The experiments include multiple models, statistical tests, and clear latency numbers, so the empirical side is straightforward to follow. The soft spot is the optimization itself: it runs directly on the GSM8K and MATH test distributions with no held-out split or cross-benchmark evaluation mentioned. Prompts could be picking up dataset-specific patterns rather than learning general format rules, and the method still needs a capable meta-agent like Sonnet 4.5 to work well. That limits how far the result travels to other structured-output settings. This is aimed at people shipping small models for agents or API-style tasks where both accuracy and strict formatting matter. It is worth sending for peer review so referees can check the generalization claims and ask for held-out or out-of-domain tests.

Referee Report

2 major / 2 minor

Summary. The paper examines the gap between task accuracy and structured output reliability (correct reasoning plus valid JSON) in 7-9B language models on GSM8K and MATH. It shows that NAIVE and REFERENCE prompting yield near-zero output accuracy despite high task accuracy, while constrained decoding adds substantial latency. The authors introduce AloLab, a black-box iterative system-prompt optimizer driven by a meta-agent (Claude Sonnet 4.5), which achieves 84-87% output accuracy on GSM8K and 34-40% on MATH across five runs, with 29/30 McNemar tests significant at p<0.05 versus static baselines, at near-NAIVE latency and no fine-tuning. The same pattern is shown for GPT-4o, and an ablation with a weaker meta-agent (Haiku) degrades performance.

Significance. If the central claims hold, the work is significant for practical deployment of small models in structured-output settings, demonstrating that prompt optimization can close the reliability gap without fine-tuning or latency penalties. Credit is due for the multi-run design, McNemar statistical tests, meta-agent ablation, and extension to a closed-source model. The empirical patterns are consistent across models and datasets, supporting the reported gains under the stated conditions.

major comments (2)

[Abstract / Evaluation] Abstract and evaluation description: the AloLab optimization is performed via repeated black-box interactions on the GSM8K and MATH test sets themselves, with no mention of a held-out validation split, early stopping on separate data, or post-optimization evaluation on distributionally shifted problems. This makes it unclear whether the discovered prompts capture general format-compliance strategies or exploit benchmark-specific patterns (e.g., common JSON field orderings or error modes), directly affecting the claim of improved structured-output reliability for arbitrary tasks.
[Results] Results section (implied by the 29/30 McNemar comparisons): while the paired significance tests are reported, the absence of cross-benchmark testing or out-of-distribution evaluation means the headline gains (84-87% GSM8K, 34-40% MATH) may not demonstrate a reliable method beyond these two math datasets, as the skeptic concern about overfitting is not addressed by the current experimental design.

minor comments (2)

[Abstract] The abstract states 'near-NAIVE inference latency' but does not quantify the exact overhead or number of meta-agent queries per optimization run, which would help assess practicality.
[Methods] Prompt trajectories and exact stopping criteria for AloLab are not visible, limiting reproducibility even though multiple runs and ablations are provided.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for highlighting important considerations regarding the evaluation methodology. We address each major comment below and propose revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation description: the AloLab optimization is performed via repeated black-box interactions on the GSM8K and MATH test sets themselves, with no mention of a held-out validation split, early stopping on separate data, or post-optimization evaluation on distributionally shifted problems. This makes it unclear whether the discovered prompts capture general format-compliance strategies or exploit benchmark-specific patterns (e.g., common JSON field orderings or error modes), directly affecting the claim of improved structured-output reliability for arbitrary tasks.

Authors: We agree that the optimization process was performed directly on the GSM8K and MATH test sets, as described in the methods section of the manuscript. This approach was chosen because AloLab is intended as a practical tool for optimizing prompts for a given task using available labeled data, without requiring a separate validation set. The meta-agent evaluates both task correctness and format compliance on each iteration, which encourages discovery of robust strategies rather than dataset-specific artifacts. The low variance across five independent optimization runs (standard deviation <1 pp for Sonnet 4.5) and the significant degradation when using a weaker meta-agent (Haiku) provide evidence against simple overfitting. Nevertheless, we acknowledge the concern and will revise the abstract and evaluation sections to explicitly disclose the use of test data for optimization and to include a discussion of potential limitations regarding generalization to new tasks. We will also add a note on how practitioners could use a held-out set for early stopping in deployment scenarios. revision: yes
Referee: [Results] Results section (implied by the 29/30 McNemar comparisons): while the paired significance tests are reported, the absence of cross-benchmark testing or out-of-distribution evaluation means the headline gains (84-87% GSM8K, 34-40% MATH) may not demonstrate a reliable method beyond these two math datasets, as the skeptic concern about overfitting is not addressed by the current experimental design.

Authors: We recognize that the current experiments are limited to two mathematical reasoning benchmarks, which serve as a controlled testbed with clear ground truth and strict output requirements. The inclusion of results on GPT-4o extends the findings beyond small open models. However, we agree that cross-benchmark and out-of-distribution evaluations would provide stronger evidence for the general applicability of AloLab. In the revised manuscript, we will expand the discussion section to explicitly address this limitation, including potential risks of benchmark-specific optimization and suggestions for future work involving additional datasets (e.g., non-mathematical tasks or shifted distributions). We maintain that the statistical significance across multiple runs and models supports the reliability of the gains within the evaluated settings. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical evaluation of prompting strategies

full rationale

The paper is a purely empirical study reporting output accuracy metrics from five prompting strategies (including the iterative AloLab meta-agent optimizer) evaluated on public GSM8K and MATH benchmarks. No equations, derivations, fitted parameters, or first-principles claims appear in the text. Performance numbers are direct experimental measurements with McNemar tests; the method description relies on black-box API interactions without self-referential definitions, load-bearing self-citations, or ansatzes. Any generalization concerns (e.g., overfitting to test distributions) are validity issues, not reductions of the reported results to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The method rests on the domain assumption that prompt optimization via black-box meta-agent interactions can improve joint task and format performance, plus the invented AloLab procedure itself. No numerical free parameters are fitted to data.

axioms (2)

domain assumption Target models respond to system prompts in a manner that permits iterative improvement of output format and correctness through black-box queries.
Underpins the entire optimization loop described.
domain assumption Output accuracy (joint mathematical correctness and valid JSON) is the appropriate primary metric for structured-output reliability.
Defines success for all comparisons.

invented entities (1)

AloLab no independent evidence
purpose: Iterative system-prompt optimizer driven by an external meta-agent
New system introduced to solve the identified format-reliability gap; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5687 in / 1487 out tokens · 58793 ms · 2026-05-08T18:47:00.931091+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 20 canonical work pages · 6 internal anchors

[1]

Claude 3 model card

Anthropic. Claude 3 model card. Technical report, Anthropic, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

2024
[2]

Claude sonnet 4.5 system card

Anthropic. Claude sonnet 4.5 system card. Technical report, Anthropic, 2025. URL https://www.anthropic.com/claude-sonnet-4-5-system-card

2025
[3]

Training Verifiers to Solve Math Word Problems

K. Cobbe et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page Pith review arXiv 2021
[4]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team . Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review arXiv 2024
[5]

JSONSchemaBench: Eval- uating Structured Output Generation in Large Language Models,

S. Geng et al. JSONSchemaBench : A rigorous benchmark of structured outputs for language models. arXiv preprint arXiv:2501.10868, 2025

work page arXiv 2025
[6]

Gemini 3 flash model card

Google DeepMind . Gemini 3 flash model card. Technical report, Google DeepMind, 2025. URL https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf

2025
[7]

The Llama 3 Herd of Models

A. Grattafiori et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review arXiv 2024
[8]

Hamilton and D

S. Hamilton and D. Mimno. Lost in space: Finding the right tokens for structured output. arXiv preprint arXiv:2502.14969, 2025

work page arXiv 2025
[9]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks et al. Measuring mathematical problem solving with the MATH dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review arXiv 2021
[10]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

B. Jin et al. Search-R1 : Training LLMs to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

work page Pith review arXiv 2025
[11]

Khattab et al

O. Khattab et al. DSPy : Compiling declarative language model calls into state-of-the-art pipelines. In International Conference on Learning Representations, 2024

2024
[12]

Liang, Y

G. Liang, Y. Bei, S. Zhou, et al. Generalizable self-evolving memory for automatic prompt optimization. arXiv preprint arXiv:2603.21520, 2026

work page arXiv 2026
[13]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

I. Mirzadeh et al. GSM -symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024

work page Pith review arXiv 2024
[14]

M \"u ndler et al

N. M \"u ndler et al. Constrained decoding of diffusion LLMs with context-free grammars. arXiv preprint arXiv:2508.10111, 2025

work page arXiv 2025
[15]

N. T. H. Nguyen et al. Thinking before constraining: A unified decoding framework for large language models. arXiv preprint arXiv:2601.07525, 2026

work page arXiv 2026
[16]

GPT-4o system card

OpenAI . GPT-4o system card. Technical report, OpenAI, 2024. URL https://openai.com/index/gpt-4o-system-card

2024
[17]

Park et al

K. Park et al. Grammar-aligned decoding. In Advances in Neural Information Processing Systems, 2024

2024
[18]

Pryzant et al

R. Pryzant et al. Automatic prompt optimization with `` G radient descent'' and beam search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023
[19]

Towards reasoning ability of small language models,

G. Srivastava et al. Towards reasoning ability of small language models. arXiv preprint arXiv:2502.11569, 2025

work page arXiv 2025
[20]

Sun et al

X. Sun et al. Earley-driven dynamic pruning for efficient structured decoding. In International Conference on Machine Learning, 2025

2025
[21]

Z. R. Tam et al. Let me speak freely? A study on the impact of format restrictions on performance of large language models. arXiv preprint arXiv:2408.02442, 2024

work page arXiv 2024
[22]

Llmstructbench: Benchmarking large language model structured data extraction.arXiv preprint arXiv:2602.14743, 2026

S. Tenckhoff et al. LLMStructBench : Benchmarking large language model structured data extraction. arXiv preprint arXiv:2602.14743, 2026

work page arXiv 2026
[23]

SynCode: LLM Generation with Grammar Augmentation,

S. Ugare et al. SynCode : LLM generation with grammar augmentation. arXiv preprint arXiv:2403.01632, 2024

work page arXiv 2024
[24]

Wang et al

Z. Wang et al. Verifiable format control for large language model generations. arXiv preprint arXiv:2502.04498, 2025

work page arXiv 2025
[25]

B. T. Willard and R. Louf. Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702, 2023

work page internal anchor Pith review arXiv 2023
[26]

Qwen2 Technical Report

A. Yang et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review arXiv 2024
[27]

Large Language Models as Optimizers

C. Yang et al. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023

work page internal anchor Pith review arXiv 2023
[28]

arXiv preprint arXiv:2603.08068 , year=

Y. Ye et al. In-context reinforcement learning for tool use in large language models. arXiv preprint arXiv:2603.08068, 2026

work page arXiv 2026
[29]

Zhou et al

Y. Zhou et al. Large language models are human-level prompt engineers. In International Conference on Learning Representations, 2023

2023