Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Adil Amin

arxiv: 2605.18838 · v3 · pith:GJWMKKG2new · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.CL

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Adil Amin This is my paper

Pith reviewed 2026-06-30 21:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords language model scalingalignment phase transitionreasoning truthfulness correlationcritical scaleoutput projection bottleneckbenchmark diagnosticscapability coupling

0 comments

The pith

Language models switch from anticorrelated to correlated reasoning and truthfulness at a critical scale around 3.5 billion parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Scaling laws describe how loss falls with more compute but leave open whether distinct capabilities inside a model compete or reinforce. This work measures the link between reasoning and truthfulness across 63 base models from 16 families and locates a sharp regime change invisible to loss curves. Below a family-dependent threshold the two abilities move in opposite directions; above it they move together. The location of the threshold depends on architecture, data curation, and training details rather than size alone, and width normalization removes the negative coupling in every family tested. The same pattern appears at frontier scale, and a single vector addition at inference time corrects a majority of misaligned outputs without any retraining.

Core claim

The coupling between reasoning and truthfulness undergoes a regime change at a family-dependent critical scale N_c of roughly 3.5 billion parameters. Below N_c the abilities anticorrelate; above N_c they correlate positively. Architecture, data curation, and training recipe each shift N_c independently, while width normalization eliminates the anticorrelation across all tested families. Internally 38 of 40 models show zero competing attention heads, and a sparse-regression ODE cross-predicts held-out models at low error. The transition is diagnosed from public benchmark scores alone and extends to the frontier.

What carries the argument

The output-projection bottleneck, diagnosed by the disappearance of anticorrelation under width normalization and by the sign flip of capability coupling at N_c.

If this is right

The cooperative regime already governs current frontier models.
Adding a single truth-direction vector at the identified layer corrects 60 percent of misaligned outputs at inference time with no weight changes.
Data curation alone can raise coupling from near zero to 0.83 at matched scale.
The phase boundary is detectable without any access to model internals.
Width normalization removes the anticorrelation in every family examined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the bottleneck is genuinely output-projection limited, analogous sign flips may appear between other pairs of capabilities.
The low-dimensional ODE that cross-predicts held-out trajectories suggests capability growth can be treated as a dynamical system.
Steering vectors located this way may generalize to other forms of misalignment.
The family dependence of N_c implies that training decisions can be used to push models into the cooperative regime earlier.

Load-bearing premise

The chosen public benchmarks supply independent, uncontaminated measures of reasoning and truthfulness so that their observed correlation reflects an internal model property rather than shared test artifacts.

What would settle it

New models trained or evaluated below the reported critical scale that display positive rather than negative correlation between reasoning and truthfulness on the same benchmarks would falsify the claimed phase transition.

Figures

Figures reproduced from arXiv: 2605.18838 by Adil Amin.

**Figure 1.** Figure 1: Capability coupling phase transition across 63 models and 16 families. (a) Phase diagram: HellaSwag vs. TruthfulQA across families, showing the U-shaped trajectory. (b) Running coupling γ12(N) for six families, with architecture-specific Nc marked. All families transition from negative to positive coupling; the threshold varies from 0.12B (OPT) to 7B (Falcon). (c) OLMo confirmation: γ12 = 0.000 at 1B param… view at source ↗

**Figure 2.** Figure 2: Loss is exact—the transition lives in the coupling. (a) Nα(L − E) = 154 ± 2 (CV= 0.8%) across all 8 Pythia models: loss follows a single power law with no visible transition. (b) Boosting chain: the independent-parameter gradient prediction (L1) makes the error 142× worse—the strongest single diagnostic that parameters are collectively coupled. The collective correction (L2) restores agreement. (c) Holdo… view at source ↗

**Figure 3.** Figure 3: ODE reproduces benchmark trajectories and cross-predicts held-out family. Sparse regression discovers a dynamical system that simultaneously fits five Pythia benchmarks (HellaSwag, TruthfulQA, ARC, WinoGrande, MMLU) at 2.6% mean error. Cross-prediction on held-out Llama-2 achieves 5.6% MAE—approximately twice the accuracy of polynomial baselines. before the Nc2 crash—the ODE is predictive within a phase b… view at source ↗

**Figure 4.** Figure 4: The alignment tax is a design choice. (a) Qwen2.5 at 1.5B shows a coupling dip (3% cooperative, net = 0.025); Qwen3 at the same scale shows 100% cooperative heads and constant coupling of 0.830. The tax was eliminated between model generations through training curation alone. (b) Width normalization: dividing benchmark scores by model width (dmodel) flips the correlation from negative to positive for all … view at source ↗

**Figure 5.** Figure 5: Internal coupling: zero competing heads across 40 models. Bars show the percentage of cooperative attention heads per family (averaged across sizes). 38 of 40 individual models show 100% cooperative heads. The two exceptions are both Qwen2.5: at 1.5B, only 3% of heads are cooperative (the remaining 97% compete—the known dip), and at 7B, 99.7% cooperative (mild last-layer dip). These pull the Qwen2.5 family… view at source ↗

**Figure 6.** Figure 6: Output projection bottleneck is scale-specific. At Pythia-410M (tax) and Pythia-2.8B (bonus), the projection increases coupling. At Pythia-1B (Nc), coupling drops from 0.725 (hidden) to 0.639 (output)—a 12% compression loss. A wider projection recovers coupling to 0.805. The bottleneck is dimensional: it appears only at the transition scale. sion loss. A wider projection recovers coupling to 0.805. This co… view at source ↗

**Figure 7.** Figure 7: The critical scale is a training parameter, not a size barrier. (a) In coupling– dimensionality space, PLE architecture trades per-dimension coupling for representational axes (Gemma-3→Gemma-4, dashed arrow), and RLHF restores coupling while preserving the extra dimensions (solid red arrow). All three models are 4B parameters. (b) Data curation eliminates the alignment tax: Qwen2.5 at 1.5B has coupling 0.… view at source ↗

**Figure 8.** Figure 8: Per-layer coupling depth profiles across the phase transition. Below Nc (70M, 160M): coupling rises with depth (positive slope). At Nc (410M–1B): coupling reverses—peaks in early layers and falls toward the output, with Pythia-1B showing the strongest decline (slope = −0.005/layer, r = −0.68). Above Nc (2.8B–12B): mild decline persists but the reversal amplitude relaxes. The final-layer drop to 0.81–0.84 … view at source ↗

read the original abstract

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale N_c, capabilities anticorrelate (r = -0.989, p = 4 x 10^{-5} nonparametric permutation test); above it, they cooperate. N_c ~ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 to 0.830 at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). A proof-of-concept intervention confirms the bottleneck is exploitable: adding a single truth-direction vector at the identified layer corrects 60% of misaligned outputs in the tax phase with zero retraining -- a surgical, per-inference correction that requires no weight modification. Code, data, an open-source steering CLI for any open-weight model, and an interactive dashboard for phase diagnosis are released: https://zehenlabs.com/cape/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The phase transition claim rests on benchmark independence, but the examples of training choices shifting the correlation point are concrete and worth checking.

read the letter

The paper reports that reasoning and truthfulness scores switch from strong anticorrelation to positive correlation at a family-dependent critical scale around 3.5B parameters, with architecture, data curation, and training recipe each able to move that point. They support this with Pearson correlations and a nonparametric permutation test across 63 models from 16 families, plus bootstrap intervals on the critical scale.

They handle a couple of things cleanly. The examples are specific: curated data removes the dip between Qwen versions, Gemma-4 at 4B reaches coupling typical of much larger standard models, and Phi at 1B matches web-trained performance at 10B. Releasing the full code, data, dashboard, and an inference-time steering CLI lets others test the pattern directly on new models. The attention-head check and the 60% correction from a single vector addition are at least falsifiable claims.

The main weakness is that the anticorrelation is computed on public benchmark scores whose independence is not demonstrated. If those reasoning and truthfulness benchmarks share contamination, training-data overlap, or selection effects, the negative correlation below the threshold could be an artifact rather than evidence of an internal bottleneck. Both the N_c fit and the sparse-regression ODE are derived from the same 63-model set, so the held-out prediction is not strongly independent. The abstract gives no detail on how the 63 models or the specific benchmarks were chosen.

This is for people tracking capability interactions during scaling who want something beyond loss curves. It has enough structure and released artifacts to deserve referee time, though the benchmark-independence premise will need direct scrutiny in review.

Referee Report

3 major / 2 minor

Summary. The paper claims a previously undetected phase transition in capability coupling: across 63 base models from 16 families, reasoning and truthfulness benchmarks anticorrelate (r = -0.989) below a family-dependent critical scale N_c ≈ 3.5B parameters and cooperate above it. N_c is shifted by architecture, data curation, and training recipe; width normalization removes the anticorrelation; 38/40 models show no competing attention heads; a sparse-regression ODE cross-predicts held-out behavior at 5.6% error; and a single truth-direction steering vector corrects 60% of misaligned outputs at inference time. The diagnostic uses only public benchmark scores.

Significance. If the central claim holds, the work identifies a scaling regime in capability interactions invisible to loss curves and supplies a falsifiable, benchmark-only diagnostic plus an exploitable bottleneck intervention. The release of code, data, steering CLI, and dashboard strengthens reproducibility. The nonparametric permutation test and bootstrap CI on N_c are positive methodological features.

major comments (3)

[Abstract and benchmark-correlation analysis] The premise that the chosen public benchmarks supply independent, uncontaminated signals of reasoning and truthfulness is load-bearing for the reported anticorrelation and phase transition, yet the nonparametric permutation test and bootstrap CI on N_c do not test benchmark construction artifacts, training-data overlap, or selection effects across the 63-model sample. This assumption enters directly into the Pearson r and p-values.
[ODE cross-prediction and N_c fitting procedure] N_c is defined and fitted from the same correlation data used to claim the phase transition; the bootstrap CI and the sparse-regression ODE are both derived from the identical model set, so the 'cross-prediction' of held-out Llama-2 behavior reduces to quantities fitted on the full dataset rather than a true out-of-sample test.
[Family-dependent N_c shifts] The claim that 'architecture, data curation, and training recipe each shift N_c independently' rests on comparisons (e.g., Qwen generations, Gemma-4, Phi) whose benchmark scores may share family-specific artifacts; no control is reported that isolates these factors from benchmark-suite effects.

minor comments (2)

[Abstract] The abstract states the ODE 'cross-predicts held-out Llama-2' without specifying the exact train/test split or whether the 5.6% error is on the same correlation features used to fit N_c.
[Internal analysis] The attention-head analysis ('38 of 40 models show zero competing attention heads') lacks a precise definition of 'competing' and the layer(s) examined; this detail is needed to assess whether it independently supports the output-projection bottleneck.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting points that strengthen the presentation of our results. We address each major comment below. Where the comments identify opportunities for clarification or additional robustness checks, we will revise the manuscript accordingly. The core empirical findings—the scale-dependent sign change in reasoning-truthfulness coupling and the family-specific location of N_c—remain unchanged, as they are directly supported by the public benchmark data across the 63 models.

read point-by-point responses

Referee: [Abstract and benchmark-correlation analysis] The premise that the chosen public benchmarks supply independent, uncontaminated signals of reasoning and truthfulness is load-bearing for the reported anticorrelation and phase transition, yet the nonparametric permutation test and bootstrap CI on N_c do not test benchmark construction artifacts, training-data overlap, or selection effects across the 63-model sample. This assumption enters directly into the Pearson r and p-values.

Authors: We agree that the permutation test and bootstrap CI evaluate the statistical properties of the observed correlation under a null of no relationship but do not directly probe benchmark contamination or training-data overlap. The benchmarks were selected because they are the standard public evaluations used in the literature to proxy the two capabilities in question; however, we will add a dedicated subsection in the Methods and a paragraph in the Discussion that (i) enumerates the specific benchmarks, (ii) cites prior work on their construction and known limitations, and (iii) reports supplementary correlations using two additional reasoning and two additional truthfulness benchmarks that were not part of the original 63-model analysis. These checks will be presented as robustness evidence rather than as a formal test of contamination. revision_made = partial. revision: partial
Referee: [ODE cross-prediction and N_c fitting procedure] N_c is defined and fitted from the same correlation data used to claim the phase transition; the bootstrap CI and the sparse-regression ODE are both derived from the identical model set, so the 'cross-prediction' of held-out Llama-2 behavior reduces to quantities fitted on the full dataset rather than a true out-of-sample test.

Authors: The referee is correct that the primary N_c estimate and its bootstrap CI are obtained from the full 63-model correlation structure. The reported 5.6 % error on Llama-2 was obtained by refitting the sparse-regression ODE after removing all Llama-2 variants from the training set and then predicting the held-out family; however, the overall model-selection step that determined the ODE structure itself used the complete dataset. We will revise the text to make this distinction explicit, replace the current description with a clearer leave-one-family-out protocol, and report the corresponding error under that stricter protocol. If the error remains comparable, we will state so; otherwise we will qualify the claim. revision_made = yes. revision: yes
Referee: [Family-dependent N_c shifts] The claim that 'architecture, data curation, and training recipe each shift N_c independently' rests on comparisons (e.g., Qwen generations, Gemma-4, Phi) whose benchmark scores may share family-specific artifacts; no control is reported that isolates these factors from benchmark-suite effects.

Authors: We acknowledge that the within-family comparisons (Qwen generations, Gemma-4 vs. prior Gemma, Phi vs. web-scale models) could in principle be influenced by family-specific benchmark artifacts. Because the paper relies exclusively on public benchmark scores, a controlled experiment that holds the benchmark suite fixed while varying only architecture or data is not feasible with the current dataset. We will therefore (i) add an explicit limitations paragraph stating this constraint, (ii) report the raw per-family correlation matrices so readers can inspect consistency, and (iii) note that the width-normalization result, which removes the anticorrelation uniformly across families, provides an internal control that is independent of the absolute benchmark values. These additions constitute a partial revision; the directional claims about N_c shifts will be softened to “consistent with” rather than “demonstrate independent” effects. revision_made = partial. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper empirically observes a sign change in Pearson correlation between public benchmark scores for reasoning and truthfulness across 63 models from 16 families, estimates a family-dependent critical scale N_c via bootstrap on that data, and fits a sparse-regression ODE that it reports as cross-predicting a held-out model. These steps constitute standard statistical estimation and cross-validation rather than any self-definitional loop, fitted input renamed as prediction, or load-bearing self-citation. No equations or claims in the provided text reduce the central observation to its own inputs by construction; the phase-transition claim is an empirical pattern in the measured correlations, not a quantity defined in terms of itself. Benchmark independence is an assumption but does not create circularity in the derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of benchmark scores as proxies for reasoning and truthfulness, the assumption that correlation changes reflect an internal bottleneck rather than measurement artifacts, and a fitted critical scale N_c whose value is determined from the same data.

free parameters (1)

N_c
Critical scale separating the two regimes, estimated per family with bootstrap CI from the observed correlation switch.

axioms (1)

domain assumption Public benchmark scores provide independent measures of reasoning capability and truthfulness whose correlation is not driven by shared contamination or selection bias.
Invoked when computing r = -0.989 and the phase boundary across the 63 models.

pith-pipeline@v0.9.1-grok · 5887 in / 1463 out tokens · 26061 ms · 2026-06-30T21:59:05.472531+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
cs.LG 2026-05 unverdicted novelty 5.0

Decomposition of SWE-bench and GPQA scores from 34 models (2024-2026) reveals positive capability coupling (r=+0.72) with lab-specific variations, benchmark saturation, and a three-level playbook for next measurements.
The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
cs.LG 2026-05 unverdicted novelty 4.0

Frontier models show positive capability coupling (r=0.72) across SWE-bench and GPQA, with lab-specific emphasis shifts measured by an h-field residual that distinguishes permanent pretraining changes from reversible ...

Reference graph

Works this paper leans on

5 extracted references · 2 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Adil Amin. The growing pains of frontier models: When leaderboards stop separating and what to measure next.arXiv preprint arXiv:2605.18840,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[3]

Transformerlens

Neel Nanda. Transformerlens. 2022.https://github.com/TransformerLensOrg/ TransformerLens. Yangjun Ruan, Chris J Maddison, and Tatsunori B Hashimoto. Observational scaling laws and the predictability of language model performance.Advances in Neural Information Processing Systems, 37,

2022
[4]

defines the surface in benchmark space where the coupling changes sign. The same condition generalizes to each successive transition: atN c,2, GPQAc = p (a2/b2)·SWE; atN c,3,IFEval c = p (a3/b3)·GPQA—witha/brecalibrated from the boundary model at each scale Amin [2026]. Within standard web-trained families, the isocline correctly predicts the coupling sig...

2026
[5]

applies theN c,3 isocline to four frontier models with IFEval scores and finds mixed-phase behavior: two models below the boundary, one at it. A.8 Additional evidence: training dynamics Direct gradient measurements on 6 Pythia models (70M–2.8B) provide an independent confirmation channel that does not rely on benchmark scores. The gradient norm follows∥∇L...

2026

[1] [1]

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Adil Amin. The growing pains of frontier models: When leaderboards stop separating and what to measure next.arXiv preprint arXiv:2605.18840,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[3] [3]

Transformerlens

Neel Nanda. Transformerlens. 2022.https://github.com/TransformerLensOrg/ TransformerLens. Yangjun Ruan, Chris J Maddison, and Tatsunori B Hashimoto. Observational scaling laws and the predictability of language model performance.Advances in Neural Information Processing Systems, 37,

2022

[4] [4]

defines the surface in benchmark space where the coupling changes sign. The same condition generalizes to each successive transition: atN c,2, GPQAc = p (a2/b2)·SWE; atN c,3,IFEval c = p (a3/b3)·GPQA—witha/brecalibrated from the boundary model at each scale Amin [2026]. Within standard web-trained families, the isocline correctly predicts the coupling sig...

2026

[5] [5]

applies theN c,3 isocline to four frontier models with IFEval scores and finds mixed-phase behavior: two models below the boundary, one at it. A.8 Additional evidence: training dynamics Direct gradient measurements on 6 Pythia models (70M–2.8B) provide an independent confirmation channel that does not rely on benchmark scores. The gradient norm follows∥∇L...

2026