arxiv: 2604.11867 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI

Recognition: unknown

Disposition Distillation at Small Scale: A Three-Arc Negative Result

Hari Sadasivan (Tinman Lab)

Pith reviewed 2026-05-10 16:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords disposition distillationnegative resultsmall language modelsLoRA fine-tuningattention temperingsidecar modelsself-verificationconfidence decoupling

0 comments

The pith

No tested method adds self-verification or uncertainty acknowledgment to small language models without harming content or reducing to stylistic mimicry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to distill behavioral dispositions such as self-verification, uncertainty acknowledgment, and feedback integration into small language models ranging from 0.6B to 2.3B parameters using a four-stage distillation pipeline. An initial internal positive result was later falsified when evaluation artifacts were removed, prompting three follow-on arcs of experiments. These arcs test supervised fine-tuning and DPO with LoRA adapters, inference-time tempering of attention heads, and a training-free sidecar that reads the final-token hidden state. Across five models and two domains, none of the approaches raise judge-measured disposition scores without either degrading content quality or collapsing into superficial stylistic imitation. The work also records a near-total decoupling of confidence and correctness in one model on a specific domain.

Core claim

Across SFT/DPO LoRA fine-tuning on three model families, inference-time attention-head tempering on the o_proj matrix, and a training-free frozen-base sidecar reading the final-token hidden state h_last, no operator moves judge-measured disposition without damaging content or collapsing into stylistic mimicry; the failure is consistent across Qwen3-0.6B, Qwen3-1.7B, Qwen3.5-0.8B, Gemma 4 E2B, and SmolLM2-1.7B-Instruct, with within-distribution probes collapsing to chance on fresh prompts.

What carries the argument

The three-arc negative-result pipeline that systematically tests LoRA alignment, attention-head tempering, and linear h_last probes after an initial result was falsified by truncation and scoring artifacts.

If this is right

Standard alignment recipes such as SFT and DPO with LoRA do not transfer behavioral dispositions at the tested scales.
Linear probes on the final hidden state fail to generalize disposition prediction beyond the training distribution.
Initial positive distillation results must be subjected to apples-to-apples re-scoring and longer generation lengths to rule out truncation artifacts.
At least one small model exhibits near-complete confidence-correctness decoupling on particular domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed failure modes may indicate a capacity limit in small models for representing dispositions separately from surface style.
Non-linear or multi-layer probes could be tested to determine whether the linear h_last approach itself is the limiting factor.
The honest falsification pipeline used here offers a template for converting self-discovered false positives into publishable negative findings.

Load-bearing premise

Judge-measured disposition scores reliably capture the intended behavioral traits independently of content quality and stylistic mimicry.

What would settle it

A concrete demonstration of any method, whether training-based or inference-based, that raises disposition scores on fresh prompts while preserving content quality and avoiding repetitive stylistic patterns.

Figures

Figures reproduced from arXiv: 2604.11867 by Hari Sadasivan (Tinman Lab).

**Figure 2.** Figure 2: Arc 3 headline. The within-distribution CV ceiling (Step 1, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Arc 3 Step 3: per-axis disposition sweep on Gemma 4 E2B. Gold bars are within [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: SmolLM2-1.7B final-token residual stream on the 193 fresh Chef prompts, projected to [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

We set out to train behavioral dispositions (self-verification, uncertainty acknowledgment, feedback integration) into small language models (0.6B to 2.3B effective parameters) through a four-stage all-MIT distillation pipeline, with follow-on experiments on inference-time attention-head interventions and a frozen-base confidence-gated sidecar. An internal draft reported +33.9-point MCAS and +15.3-point HumanEval gains on a Qwen3-0.6B student; a second-pass sanity check falsified both numbers before publication. The HumanEval delta was a truncation artifact (n_predict=512) that inverted to -8.0 points at n_predict=1024; the MCAS gain disappeared under apples-to-apples scoring. That falsification triggered three subsequent arcs. Across (1) SFT/DPO LoRA on three model families and two domains, (2) inference-time attention-head tempering on o_proj, and (3) a training-free frozen-base sidecar reading the final-token hidden state h_last, we find no operator that moves judge-measured disposition without damaging content or collapsing into stylistic mimicry. The failure is consistent across five models (Qwen3-0.6B, Qwen3-1.7B, Qwen3.5-0.8B, Gemma 4 E2B, and SmolLM2-1.7B-Instruct). A within-distribution cross-validation pass (AUC=0.683) collapsed to chance on fresh prompts (AUC=0.516). We contribute a three-arc negative result with mechanism, a two-failure-mode taxonomy for linear h_last probes, and an honest falsification pipeline that converts the class of false positives we ourselves produced into publishable negatives. As an independent finding, Gemma 4 E2B exhibits near-complete confidence-correctness decoupling on the Chef domain (assertion asymmetry -0.009; the model asserts at 91% regardless of correctness).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript reports a three-arc negative result on distilling behavioral dispositions (self-verification, uncertainty acknowledgment, feedback integration) into small LMs (0.6B–2.3B parameters). Initial positive findings on Qwen3-0.6B were falsified as artifacts of n_predict=512 truncation and non-apples-to-apples scoring; subsequent experiments with SFT/DPO LoRA on three families/two domains, o_proj attention-head tempering, and a frozen-base h_last sidecar find no operator that improves judge-measured disposition without content damage or stylistic mimicry. Results are consistent across five models; the h_last probe shows within-distribution AUC 0.683 collapsing to 0.516 on fresh prompts. Additional finding: Gemma 4 E2B exhibits near-complete confidence-correctness decoupling on the Chef domain.

Significance. If the disposition judge isolates the target traits, the work supplies a credible negative result at small scale, a falsification pipeline that converts self-produced false positives into publishable negatives, and a two-failure-mode taxonomy for linear h_last probes. These elements are valuable for the field even if the absolute claim is scoped to the tested operators and judge.

major comments (1)

The central claim that no operator moves disposition without side effects rests on the judge reliably scoring the intended traits independently of content quality and stylistic mimicry. While the sidecar probe's AUC drop (0.683 to 0.516) is reported, the manuscript does not provide equivalent cross-prompt or cross-domain validation for the primary disposition judge; without it the consistent 'no movement' finding risks being an artifact of correlated scoring rather than evidence that no effective operator exists.

minor comments (2)

The abstract and main text use 'Gemma 4 E2B' without a precise model identifier or citation; add the exact checkpoint name and reference for reproducibility.
The two-failure-mode taxonomy for h_last probes is a useful contribution; consider a short table or diagram that maps the modes to the observed AUC collapse and mimicry cases.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The concern about validating the primary disposition judge is well-taken and directly relevant to the strength of our negative result. We address it below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The central claim that no operator moves disposition without side effects rests on the judge reliably scoring the intended traits independently of content quality and stylistic mimicry. While the sidecar probe's AUC drop (0.683 to 0.516) is reported, the manuscript does not provide equivalent cross-prompt or cross-domain validation for the primary disposition judge; without it the consistent 'no movement' finding risks being an artifact of correlated scoring rather than evidence that no effective operator exists.

Authors: We agree that the absence of reported cross-validation for the primary disposition judge is a limitation in the current manuscript. The judge was prompted to isolate the three target traits (self-verification, uncertainty acknowledgment, feedback integration) with explicit instructions to ignore content quality, factual accuracy, and stylistic mimicry; our three-arc design further tested this by showing that operators either damaged content or produced only stylistic changes without trait movement. However, we did not perform the same systematic cross-prompt and cross-domain checks we applied to the h_last sidecar. In the revised manuscript we will add an equivalent validation section for the judge, including (1) cross-prompt consistency on held-out prompt sets and (2) cross-domain transfer between the two domains tested. These results will be reported alongside the existing sidecar analysis and will be used to qualify the strength of the 'no movement' claim. We view this as a necessary strengthening rather than a refutation of the negative result. revision: yes

Circularity Check

0 steps flagged

No circularity; self-contained empirical negative result

full rationale

The paper reports direct experimental outcomes across SFT/DPO LoRA, inference-time attention tempering, and a frozen-base h_last sidecar on five models. Central claims rest on observed performance deltas, judge scores, and AUC measurements (e.g., within-distribution 0.683 collapsing to 0.516 out-of-distribution) rather than any derived quantities. Explicit self-falsification of the authors' own prior positive results (MCAS/HumanEval artifacts) is presented transparently without re-deriving those positives. No equations, fitted parameters, uniqueness theorems, or self-citations appear as load-bearing steps. The work is a straightforward empirical falsification pipeline with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a purely empirical negative-result study. No new free parameters, theoretical axioms beyond standard ML evaluation practices, or invented entities are introduced.

axioms (1)

domain assumption Judge models provide a valid and stable measurement of behavioral dispositions independent of content and style.
The central claim rests on judge-measured disposition as the primary outcome metric.

pith-pipeline@v0.9.0 · 5665 in / 1319 out tokens · 75980 ms · 2026-05-10T16:05:01.027535+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 11 canonical work pages · 8 internal anchors

[1]

[Allal et al.(2024)]L. B. Allal, A. Lozhkov, G. Penedo, T. Wolf, and L. von Werra. SmolLM: A family of small language models. Technical report, Hugging Face,

2024
[2]

Belinkov

[Belinkov(2022)] Y. Belinkov. Probing classifiers: Promises, shortcomings, and advances.Com- putational Linguistics, 48(1):207–219,

2022
[3]

[Chen et al.(2021)]M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Pinto de Oliveira, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E....

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

DeepSeek-V3 Technical Report

[DeepSeek-AI(2024)] DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Gemma 3 and Gemma 4 technical report

[Gemma Team(2025)]Gemma Team, Google DeepMind. Gemma 3 and Gemma 4 technical report. Technical report, Google DeepMind,

2025
[6]

[Geva et al.(2021)]M. Geva, R. Schuster, J. Berant, and O. Levy. Transformer feed-forward layers are key-value memories. InProceedings of EMNLP,

2021
[7]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

15 [GLM Team(2024)]GLM Team, Zhipu AI. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools.arXiv preprint arXiv:2406.12793,

work page internal anchor Pith review arXiv 2024
[8]

Hewitt and C

[Hewitt and Liang(2019)]J. Hewitt and C. D. Liang. Designing and interpreting probes with control tasks. InProceedings of EMNLP,

2019
[9]

Distilling the Knowledge in a Neural Network

[Hinton et al.(2015)]G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Hsieh, C.-L

[Hsieh et al.(2023)]C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C.-Y. Lee, and T. Pfister. Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes. InFindings of ACL,

2023
[11]

[Hu et al.(2022)]E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InProceedings of ICLR,

2022
[12]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

[Kimi Team(2025)]Kimi Team, Moonshot AI. Kimi K1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review arXiv 2025
[13]

net/forum?id=HyUNwulC-

[MiniMax(2025)] MiniMax. MiniMax-01: Scaling foundation models with lightning attention. arXiv preprint arXiv:2501.08313,

work page arXiv 2025
[14]

Rafailov, A

[Rafailov et al.(2023)]R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in NeurIPS,

2023
[15]

Steering Llama 2 via Contrastive Activation Addition

[Rimsky et al.(2023)]N.Rimsky, N.Gabrieli, J.Schulman, M.Turner, A.Chan, andE.Hubinger. Steering Llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681,

work page internal anchor Pith review arXiv 2023
[16]

Steering Language Models With Activation Engineering

[Turner et al.(2023)]A. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. Macdi- armid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review arXiv 2023
[17]

Zelikman, Y

[Zelikman et al.(2022)]E. Zelikman, Y. Wu, J. Mu, and N. Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in NeurIPS,

2022
[18]

arXiv preprint arXiv:2505.22954 , year=

[Zhang et al.(2025)]J. Zhang, J. Lehman, K. Stanley, and J. Clune. The Darwin Gödel Machine. arXiv preprint arXiv:2505.22954,

work page arXiv 2025
[19]

[Zhu et al.(2024)]H. Zhu, Z. Zhang, B. Chen, Z. Wang, H. Zhao, and W. Chen. APOLLO: SGD-like memory, AdamW-level performance.arXiv preprint arXiv:2412.05270,

work page arXiv 2024
[20]

[Zou et al.(2023)]A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review arXiv 2023