arxiv: 2605.11907 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

Igor Strozzi

Pith reviewed 2026-05-13 05:40 UTC · model grok-4.3

classification 💻 cs.LG

keywords SFTprocedural skillmodel scalingLLM fine-tuningQwen3.5capacity tiersW-shaped trajectoryregime asymmetry

0 comments

The pith

SFT provides roughly uniform procedural skill gains across 0.8B to 4B models, but pre-SFT performance follows a W-shaped pattern that shapes final outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks supervised fine-tuning on procedural skills for Qwen3.5 models at three sizes: 0.8B, 2B, and 4B parameters. It finds that the direct lift from SFT stays similar at each scale when evaluation paths are matched and judged by large language models. Pre-SFT starting performance, however, traces a W-shaped curve across the sizes, which then drives most of the variation seen after training. SFT delivers its largest absolute improvements precisely where the base model was weakest on the procedures.

Core claim

Under matched-path LLM-only scoring, the SFT-attributable procedural-Δ lift is roughly uniform across sizes: +0.070 / +0.040 / +0.075 at 0.8B / 2B / 4B. Variation in post-SFT Δ (−0.005, +0.100, +0.065) is dominated by a W-shaped pre-SFT base trajectory (−0.075, +0.060, −0.010, Haiku-4-5 at +0.030): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure.

What carries the argument

The W-shaped pre-SFT base trajectory across capacity tiers that creates a regime-asymmetric compensation pattern under SFT.

Load-bearing premise

The 200-task holdout and LLM judge-pass metric measure true procedural skill without residual bias from format-compliance artifacts or single-seed variability.

What would settle it

Measuring the same procedural-Δ lifts and pre-SFT trajectory on 8B and 14B models would confirm or refute the claim of uniform SFT gains paired with continuing W-shaped base variation.

Figures

Figures reproduced from arXiv: 2605.11907 by Igor Strozzi.

**Figure 1.** Figure 1: Baseline (baseline) and curated (curated) pass rates across all eleven evaluated configurations: pre-SFT controls, the five 0.8B SFT iterations (§6), and the model-size pivot to 2B and 4B. 3.2 W-shaped pre-SFT base trajectory The pre-SFT base trajectory of procedural responsiveness is non-monotone with two negative pockets: • pre-SFT 0.8B (HF, LLM-only): ∆ = −0.075 — deepest trough • pre-SFT 2B (HF, LLM-on… view at source ↗

**Figure 2.** Figure 2: Per-skill procedural-∆ at v2.0 (n=5 per skill, 40 skills). 11 skills lift on curated; 25 are flat (mostly because baseline is already at ceiling); 4 regress. The lift cluster concentrates on interpretive and spatial-social skills; the regression cluster on deductive-logic skills where the model has ceiling competence on baseline. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Per-skill procedural-∆ at v2.0 versus per-skill baseline pass rate (LLM-only judging, n=5 tasks per skill, 40 skills). Spearman ρ = −0.227. Skills where the base is already at ceiling on baseline cannot benefit from procedure injection (flat cluster); skills with lower baseline have the most ∆ headroom (lift cluster). The negative slope is qualitative evidence that SFT’s distinctive value is in scaffolding… view at source ↗

**Figure 4.** Figure 4: v1.9 lift over pre-SFT 0.8B decomposed into base-model scaling (gray; measured between the [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Left: bootstrap distributions of the procedural- [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. \textbf{Main finding.} Under matched-path LLM-only scoring, the SFT-attributable procedural-$\Delta$ lift is roughly uniform across sizes: $+0.070$ / $+0.040$ / $+0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $\Delta$ ($-0.005$, $+0.100$, $+0.065$) is dominated by a W-shaped pre-SFT base trajectory ($-0.075$, $+0.060$, $-0.010$, Haiku-4-5 at $+0.030$): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. \textbf{Methodology.} (i) A bench format-compliance artifact: 83.5\% of the holdout uses a deterministic \texttt{ANSWER}-line extractor that under-counts free-form conclusions; an LLM-only re-judge reveals it was systematically biased against \CU. (ii) A negative-iteration sequence at 0.8B: five recipe variants cluster post-SFT \CU{} pass-rate within a 2\,pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe. \textbf{Cross-family validation.} GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen's $\kappa \geq 0.754$, agreement $\geq 93.25\%$. Earlier ``format-only at 0.8B'' and ``shrinking SFT at 4B'' framings were path-mismatch artifacts; this paper supersedes both (Appendix~\ref{sec:appendix-path}). Single-seed; threats in \S\ref{sec:threats}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds roughly uniform SFT lifts on procedural skills across 0.8B-4B Qwen3.5 but attributes post-SFT differences to a W-shaped pre-SFT base trajectory, with decent cross-judge validation yet single-seed results that leave the small deltas vulnerable.

read the letter

The central observation is that SFT adds a similar procedural-skill boost at each of these three sizes while the base models show a W pattern before training, hurting at 0.8B and 4B but helping at 2B. That pattern plus the regime-asymmetric compensation claim is the new piece, and it comes with a clear prediction for larger scales. The work does a few things right on the evaluation side. They caught the format-compliance bias in the deterministic extractor, switched to LLM-only judging, and then ran the full set through GPT-5.4 for cross-validation, landing at 93%+ agreement and kappa above 0.75. The negative-iteration check at 0.8B also shows the pass-rate ceiling is set by model capacity rather than training recipe. Those steps make the measurements more credible than a raw judge-pass setup would have been. The main weakness is the single-seed design paired with small reported deltas. The lifts sit between 0.04 and 0.075 and the pre-SFT swings are comparable in size, yet there are no error bars, no repeated runs, and no quantification of training or evaluation variance on the 200-task holdout. If those differences are within run-to-run noise, the W shape and the uniform-lift story both weaken. The scope is also narrow—one model family, three sizes—so the broader regime-asymmetric mechanism rests on limited points. This is useful for people who train small dense models and want to see how SFT interacts with base procedural capability on concrete tasks. It has enough concrete numbers and a falsifiable claim to justify sending it to peer review, though any serious referee will ask for multi-seed statistics and probably wider validation before the pattern can be treated as reliable.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically measures the contribution of procedural-skill supervised fine-tuning (SFT) across Qwen3.5 dense models at 0.8B, 2B, and 4B scales on a 200-task/40-skill holdout set. Under matched-path LLM-only scoring with Claude Haiku 4.5 (and GPT-5.4 cross-validation), it reports roughly uniform SFT-attributable procedural-Δ lifts of +0.070 / +0.040 / +0.075, with post-SFT variation dominated by a W-shaped pre-SFT base trajectory (-0.075 / +0.060 / -0.010). This supports a regime-asymmetric mechanism in which SFT provides the largest absolute benefit where the base model struggles with the 5-step procedure, along with a falsifiable prediction for 8B/14B scales. The work also documents and mitigates an 83.5% format-compliance artifact from a deterministic extractor and supersedes prior path-mismatch framings.

Significance. If the reported deltas and W-shaped trajectory prove robust, the results would provide a capacity-tiered view of SFT effects on procedural skills, showing that SFT compensates most where base performance is weakest rather than exhibiting uniform shrinkage or format-only gains. The cross-family validation (Cohen's κ ≥ 0.754, ≥93.25% agreement across 2800 episodes) and explicit artifact analysis are strengths that improve upon earlier mismatched-path comparisons. The falsifiable prediction at larger scales adds testability to the regime-asymmetric claim.

major comments (2)

[§ref{sec:threats} and experimental methodology] §ref{sec:threats} and experimental setup: The central claims rest on small numerical deltas (SFT lifts of +0.070/+0.040/+0.075 and pre-SFT W-trajectory values of -0.075/+0.060/-0.010) obtained from single-seed training and single-run evaluation on the 200-task holdout. No standard errors, confidence intervals, or multi-seed statistics are provided, so it is impossible to determine whether the observed uniformity of lifts or the dominance of the pre-SFT trajectory exceeds run-to-run variance in training stochasticity and LLM-judge noise. This is load-bearing for the regime-asymmetric mechanism and the 8B/14B prediction.
[Methodology on format-compliance artifact] Methodology paragraph on format-compliance artifact: Although the 83.5% deterministic ANSWER-line extractor bias is identified and addressed via LLM-only re-judge plus GPT-5.4 cross-validation, the manuscript does not report a sensitivity analysis quantifying how much the original bias (or the mitigation) shifts the per-size procedural-Δ values. Without this, residual bias cannot be ruled out as a contributor to the reported uniformity or W-shape.

minor comments (2)

[Abstract and §1] The abbreviation CU (presumably 'correct under judge') and the exact definition of 'procedural-Δ' should be introduced with an explicit equation or formula on first use in the main text for reader clarity.
[Results section] Table or figure presenting the per-size deltas would benefit from an additional column or row showing the raw pre-SFT and post-SFT pass rates alongside the Δ values to allow direct verification of the W-shape arithmetic.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback. We appreciate the emphasis on statistical robustness of the reported deltas and the need for sensitivity analysis on the format artifact. We address each major comment below and commit to revisions where feasible with existing data.

read point-by-point responses

Referee: [§ref{sec:threats} and experimental methodology] §ref{sec:threats} and experimental setup: The central claims rest on small numerical deltas (SFT lifts of +0.070/+0.040/+0.075 and pre-SFT W-trajectory values of -0.075/+0.060/-0.010) obtained from single-seed training and single-run evaluation on the 200-task holdout. No standard errors, confidence intervals, or multi-seed statistics are provided, so it is impossible to determine whether the observed uniformity of lifts or the dominance of the pre-SFT trajectory exceeds run-to-run variance in training stochasticity and LLM-judge noise. This is load-bearing for the regime-asymmetric mechanism and the 8B/14B prediction.

Authors: We acknowledge that all results are from single-seed training and single-run evaluation, as stated in the manuscript. Section §ref{sec:threats} already identifies training stochasticity and LLM-judge noise as limitations. The GPT-5.4 cross-validation (κ ≥ 0.754 across 2800 episodes) shows consistent directionality for every per-size finding, offering some robustness against judge noise. However, we lack multi-seed runs and cannot compute standard errors or confidence intervals without new experiments. We will expand the threats section to discuss run-to-run variance more explicitly and qualify the uniformity claim and 8B/14B prediction with this caveat. revision: partial
Referee: [Methodology on format-compliance artifact] Methodology paragraph on format-compliance artifact: Although the 83.5% deterministic ANSWER-line extractor bias is identified and addressed via LLM-only re-judge plus GPT-5.4 cross-validation, the manuscript does not report a sensitivity analysis quantifying how much the original bias (or the mitigation) shifts the per-size procedural-Δ values. Without this, residual bias cannot be ruled out as a contributor to the reported uniformity or W-shape.

Authors: We agree a quantitative sensitivity analysis is needed. While the LLM-only re-judge and GPT-5.4 validation mitigate the 83.5% extractor bias, we did not report how much the original versus mitigated scoring shifts the per-size SFT lifts or pre-SFT W-trajectory. In revision we will add this by recomputing the procedural-Δ values under both methods for each scale and tabulating the differences, directly testing whether residual bias could drive the uniformity or W-shape. revision: yes

standing simulated objections not resolved

The request for standard errors, confidence intervals, and multi-seed statistics to assess whether deltas exceed run-to-run variance, as these require new training runs not present in the current single-seed experiments.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper consists entirely of direct empirical measurements: procedural-Δ lifts are computed from pass-rate differences on a 200-task holdout using LLM judges (Claude Haiku 4.5 and GPT-5.4 cross-validation). No equations, fitted parameters, ansatzes, or derivations are presented; the W-shaped trajectory and regime-asymmetric claim are simply descriptive summaries of the observed deltas. No self-citations are invoked as load-bearing uniqueness theorems or to justify any modeling choice. The single-seed limitation and format-compliance artifact are acknowledged as threats to validity but do not create circular reduction of any result to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a pure empirical measurement study with no mathematical derivations, no free parameters fitted to produce the central deltas, and no new entities postulated.

axioms (1)

domain assumption LLM judges (Claude Haiku 4.5 and GPT-5.4) provide reliable procedural-skill scores that match human judgment.
The paper uses these models for all scoring and validation without reporting human calibration or inter-rater agreement.

pith-pipeline@v0.9.0 · 5782 in / 1588 out tokens · 137218 ms · 2026-05-13T05:40:06.181743+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

D. Yu, S. Kaur, A. Gupta, J. Brown-Cohen, A. Goyal, S. Arora.Skill-Mix: A Flexible and Expandable Family of Evaluations for AI Models. NeurIPS 2024

work page 2024
[2]

Arora, A

S. Arora, A. Goyal.A Theory for Emergence of Complex Skills in Language Models. 2023

work page 2023
[3]

Li et al.SkillsBench: Benchmarking the Effectiveness of Skill Injection on LLMs. 2026

work page 2026
[4]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen.LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022

work page 2022
[5]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer.QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023

work page 2023
[6]

Tunstall, E

L. Tunstall, E. Beeching, et al.TRL: Transformer Reinforcement Learning library, version≥0.18

work page
[7]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS Datasets and Benchmarks 2023

work page 2023
[8]

Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782, 2024

S. Biderman, H. Schoelkopf, et al.Lessons from the Trenches on Reproducible Evaluation of Language Models. arXiv:2405.14782, 2024

work page arXiv 2024
[9]

A. Yang, A. Li, B. Yang, et al.Qwen3 Technical Report. arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

∆-flipped negative

Qwen Team.Qwen3.5: Accelerating Productivity with Native Multimodal Agents. Release blog and model cards, 2026.https://huggingface.co/Qwen/Qwen3.5-4B-Base. 12 A Path-mismatch resolution for pre-SFT 0.8B The original pre-SFT 0.8B baseline (baseline0.510/curated0 .565/∆+0 .055) was run via Ollama, not the same HuggingFace transformers path used for 2B/4B. Q...

work page 2026