arxiv: 2605.12129 · v1 · submitted 2026-05-12 · 💻 cs.SE · cs.AI· cs.OS

Recognition: no theorem link

It's Not the Size: Harness Design Determines Operational Stability in Small Language Models

Yong-eun Cho

Pith reviewed 2026-05-13 04:20 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.OS

keywords small language modelsharness designtask success rateprompt engineeringoperational stabilityformat violationsSLM performance

0 comments

The pith

Harness design controls stability and success rates in small language models more than parameter count.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three levels of prompt harness on 2-3B parameter models across 24 tasks to measure how wrapper structure affects reliable output. Raw prompts, simple tag wrappers, and a four-stage pipeline of plan-execute-verify-recover are compared using task success rate and valid task success rate. The pipeline reaches 0.952 success and perfect valid success on one model, while simpler wrappers sometimes drop below raw prompts and one model loses all JSON structure without support. The work shows that interface engineering can make tiny models operationally usable without scaling up size.

Core claim

Operational stability in small language models depends on harness level rather than size alone. Model-only prompts produce scaffold collapse with seven format violations and TSR of 0.429 in LLaMA 3.2 3B. Minimal-shell wrappers yield non-monotonic drops below model-only in two models. The 4-stage pipeline harness achieves TSR of 0.952 and VTSR of 1.000 on Gemma-2B across 21 tasks, with ablation attributing roughly 24.7 percent of the total gain to the planning stage and another 24.7 percent to the recovery stage, plus a verification catch rate of 0.625.

What carries the argument

Three harness conditions (model-only raw prompt, minimal-shell wrapper tags, and 4-stage pipeline of plan-execute-verify-recover) that structure model interaction and enforce output format.

Load-bearing premise

The 24 tasks and three chosen 2-3B models are representative enough that the observed harness effects and non-monotonic behavior will appear in other small models and applications.

What would settle it

Running the same three harness conditions on a fourth 2-3B model or on twenty new tasks and finding that the pipeline no longer outperforms model-only or that minimal-shell no longer underperforms would falsify the claim that harness design determines stability.

Figures

Figures reproduced from arXiv: 2605.12129 by Yong-eun Cho.

**Figure 2.** Figure 2: Category-level TSR comparison (Gemma4 E2B, T1–T5). T2 and T3 achieve TSR=1.00 under all conditions. T4 shows the largest pipeline gain (+0.50 from model-only, +0.75 from minimal-shell). T4 (Workflow Completion): Pipeline TSR=1.00 vs. model-only 0.50 / minimal-shell 0.25. Pipeline planning pre-structures execution flow; recovery handles timeout retries. T2 and T3: all conditions TSR=1.00— structurally simpl… view at source ↗

**Figure 4.** Figure 4: Cross-model TSR comparison (T1–T5). Gemma4 E2B shows largest pipeline gain (+0.190). Qwen3.5:2B pipeline TSR (0.857) falls below model-only (0.952) — verify-recover induces repeated score=1 in T2. LLaMA 3.2(3B) model-only (TSR=0.429) fails threshold without harness. 5.2 Task × Model Score Matrix (Pipeline Condition) Table VIII. Task × model score comparison (pipeline condition)§ Task Category Gemma4 E2B Qw… view at source ↗

**Figure 3.** Figure 3: Ablation stage TSR (Gemma4 E2B, T1–T6). model-only (0.667) → full pipeline (0.833): total gain +0.166. pipeline-no-verify TSR=0.909 reversal is a T6 rate-limit artifact (§6.3). Table XI. Pipeline component contribution analysis Removed Stage Resulting TSR TSR Drop vs. Full % of Total Gain Planning 0.792 −0.041 24.7% Verify + Recover 0.909 −0.076 (reversed†) N/A Recover 0.792 −0.041 24.7% † pipeline-no-veri… view at source ↗

read the original abstract

This paper experimentally analyzes how the level of harness engineering affects the operational performance of small language models (SLMs, 2-3B parameters). Three harness conditions - model-only (raw prompt), minimal-shell (wrapper tags), and a 4-stage pipeline (plan->execute->verify->recover) - are applied to three models (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, comparing Task Success Rate (TSR) and Valid TSR (VTSR). The pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B (T1-T5, 21 tasks). A non-monotonic phenomenon - minimal-shell TSR < model-only TSR - is observed in two models. In LLaMA 3.2 3B model-only, seven format violations yield TSR=0.429, revealing scaffold collapse: the model abandons JSON structure under complex format requirements without harness support. Ablation shows planning and recovery each contribute approximately 24.7% of total gain. VCR (Verification Catch Rate)=0.625 across all pipeline runs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Harness design lifts success rates in these 2-3B runs and shows some non-monotonic effects, but three models and no stats keep the broader claim preliminary.

read the letter

The paper's clearest contribution is the head-to-head comparison of three harness levels on three small models. The four-stage pipeline reaches 0.95 TSR and perfect VTSR on the Gemma runs, while model-only drops sharply in the LLaMA case from format violations, and minimal shells sometimes perform worse than raw prompts. The ablation that attributes roughly a quarter of the gain to planning and another quarter to recovery is straightforward and useful to see broken out. The scaffold collapse example is a concrete illustration of what happens when format demands exceed the model's unaided capacity.

Referee Report

3 major / 2 minor

Summary. The manuscript experimentally investigates the impact of harness design on the operational stability of small language models (2-3B parameters). It evaluates three conditions—model-only prompting, minimal-shell wrappers, and a four-stage pipeline (plan, execute, verify, recover)—across three models (Gemma-2B, Qwen-2B, LLaMA-3B) on 24 tasks, using Task Success Rate (TSR) and Valid Task Success Rate (VTSR) as metrics. Key findings include pipeline achieving TSR=0.952 and VTSR=1.000 on Gemma for 21 tasks, non-monotonic performance where minimal-shell underperforms model-only in two models, scaffold collapse in LLaMA model-only with TSR=0.429 due to format violations, and ablations showing planning and recovery each contributing about 24.7% to performance gains, with VCR=0.625.

Significance. If the results generalize, this work underscores that for SLMs, careful harness engineering can dramatically improve reliability in task execution, particularly in maintaining output formats and recovering from errors, which may be more critical than scaling model size. The non-monotonic effects and specific failure modes like scaffold collapse offer practical insights for deploying small models in applications requiring structured outputs.

major comments (3)

[Abstract and Results] The headline performance claims (TSR=0.952 and VTSR=1.000 on Gemma4 E2B, T1-T5, 21 tasks) and the ablation attributing ~24.7% gains each to planning and recovery stages rest on a narrow experimental base of three models and 24 tasks (with results reported on a 21-task subset); task selection criteria, diversity, and full protocol details are not provided, undermining the claim that harness design determines stability across SLMs rather than reflecting task- or model-specific artifacts.
[Results] No statistical tests, error bars, or variance measures are reported for the TSR/VTSR values or the non-monotonic reversals (minimal-shell TSR < model-only TSR in two models), making it impossible to assess whether the observed differences (e.g., LLaMA model-only TSR=0.429 with seven format violations) are reliable or could arise from sampling variability.
[Results on LLaMA] The scaffold collapse phenomenon in LLaMA 3.2 3B model-only is presented as evidence of harness necessity, but without additional controls (e.g., varying prompt complexity or format requirements across models), it is unclear whether this is a general harness effect or tied to the specific 24-task distribution and prompt templates used.

minor comments (2)

[Abstract] The metrics TSR, VTSR, and VCR (Verification Catch Rate=0.625) are introduced without explicit definitions or formulas in the abstract; these should be defined at first use with examples for clarity.
[Results] The paper would benefit from a table summarizing all TSR/VTSR results across the three models and harness conditions to allow direct comparison of the non-monotonic effects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with point-by-point responses. Revisions have been made to the manuscript where the concerns identify gaps in reporting or clarity.

read point-by-point responses

Referee: [Abstract and Results] The headline performance claims (TSR=0.952 and VTSR=1.000 on Gemma4 E2B, T1-T5, 21 tasks) and the ablation attributing ~24.7% gains each to planning and recovery stages rest on a narrow experimental base of three models and 24 tasks (with results reported on a 21-task subset); task selection criteria, diversity, and full protocol details are not provided, undermining the claim that harness design determines stability across SLMs rather than reflecting task- or model-specific artifacts.

Authors: We acknowledge the limited scope of three models and 24 tasks. The tasks were drawn from standard benchmarks to span reasoning, coding, knowledge, and instruction-following categories; we have added an explicit subsection in the revised Methods describing selection criteria, category distribution, and the complete evaluation protocol (including prompt templates and success criteria). While broader validation across more models and tasks would be desirable, the replication of key patterns (non-monotonic effects, format collapse, and stage contributions) across three architecturally distinct models supports the central claim that harness design is a primary determinant of operational stability rather than an artifact of any single task set. revision: yes
Referee: [Results] No statistical tests, error bars, or variance measures are reported for the TSR/VTSR values or the non-monotonic reversals (minimal-shell TSR < model-only TSR in two models), making it impossible to assess whether the observed differences (e.g., LLaMA model-only TSR=0.429 with seven format violations) are reliable or could arise from sampling variability.

Authors: The referee correctly notes the absence of statistical measures. Our original evaluation used deterministic decoding and single runs per condition, so variance statistics were not computed. In the revision we have added a limitations paragraph in Results discussing this deterministic setup and the potential for sampling variability; where multiple prompt phrasings were internally tested we now report the observed range. We agree this is a methodological limitation and have updated the text accordingly. revision: partial
Referee: [Results on LLaMA] The scaffold collapse phenomenon in LLaMA 3.2 3B model-only is presented as evidence of harness necessity, but without additional controls (e.g., varying prompt complexity or format requirements across models), it is unclear whether this is a general harness effect or tied to the specific 24-task distribution and prompt templates used.

Authors: The scaffold collapse is shown by within-model, within-task comparison: the identical prompts and format requirements produce seven format violations (TSR=0.429) in the model-only condition but are resolved under the harness conditions. This design isolates the contribution of the harness. We agree that explicit controls varying prompt complexity would further generalize the finding; we have added a paragraph in the Discussion acknowledging this and outlining how future experiments could systematically vary format demands. The current results still demonstrate that, for the complex structured-output requirements in our task set, harness support is required to maintain operational stability. revision: partial

Circularity Check

0 steps flagged

No circularity: purely experimental comparisons with no derivations or fitted parameters

full rationale

The paper reports direct measurements of Task Success Rate (TSR) and Valid TSR (VTSR) under three harness conditions (model-only, minimal-shell, pipeline) applied to three specific SLMs across 24 tasks. No equations, derivations, parameter fitting, or self-citations appear in the provided text or abstract. All claims, including the non-monotonic phenomenon, scaffold collapse, and ablation contributions, rest on observed experimental outcomes rather than any reduction to inputs by construction. The derivation chain is therefore self-contained and independent of the patterns that would trigger circularity flags.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical experimental study with no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5511 in / 1131 out tokens · 68966 ms · 2026-05-13T04:20:18.546691+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 7 internal anchors

[1]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, "Reflexion: Language Agents with Verbal Reinforcement Learning," arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

ReAct: Synergizing Reasoning and Acting in Language Models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, "ReAct: Synergizing Reasoning and Acting in Language Models," ICLR 2023

work page 2023
[3]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

L. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, W. Chen, and M. Zhou, "CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing," arXiv:2305.11738, 2023

work page internal anchor Pith review arXiv 2023
[4]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

M. Abdin et al., "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone," arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Google DeepMind, "Gemma: Open Models Based on Gemini Research and Technology," arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Mistral 7B

A. Q. Jiang et al., "Mistral 7B," arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Prompting Is Programming: A Query Language for Large Language Models,

L. Beurer-Kellner, M. Fischer, and M. Vechev, "Prompting Is Programming: A Query Language for Large Language Models," PLDI 2023

work page 2023
[8]

Guidance: A guidance language for controlling large language models,

Microsoft Guidance Team, "Guidance: A guidance language for controlling large language models," GitHub, 2023

work page 2023
[9]

Qwen2.5 Technical Report

Qwen Team, Alibaba Cloud, "Qwen2.5 Technical Report," arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Ollama: Get up and running with large language models locally,

Ollama Team, "Ollama: Get up and running with large language models locally," https://github.com/ollama/ollama, 2023

work page 2023
[11]

Efficient Guided Generation for Large Language Models

R. Willard and J. Louf, "Efficient Guided Generation for Large Language Models," arXiv:2307.09702, 2023

work page internal anchor Pith review arXiv 2023