Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

Archit Manek

arxiv: 2606.29196 · v1 · pith:DNRKHSXGnew · submitted 2026-06-28 · 💻 cs.LG · cs.CL

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

Archit Manek This is my paper

Pith reviewed 2026-06-30 09:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords evaluation awarenessrepresentational depthlanguage modelsmodel scalelinear probesscaling lawsAI safety

0 comments

The pith

Evaluation awareness in language models shifts to earlier layers as models scale up.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether language models detect when they are being tested, a capacity that could let them change behavior in ways that complicate safety benchmarks. Across Qwen 2.5 and Gemma 2 families it reports that the network layer yielding the strongest linear signal for this awareness moves from late positions in small models to early positions in large models. This depth change accounts for the observed non-monotonic or inverse scaling of evaluation-related behavior within families rather than smooth improvement. White-box linear probes on hidden states detect the signal more reliably than black-box behavioral measures, and the link between the two varies across families.

Core claim

In both Qwen 2.5 and Gemma 2, the layer at which evaluation-awareness is most linearly recoverable moves from late layers in smaller models to early layers in larger ones, showing that scale alters not only the strength but also the representational depth of this capacity.

What carries the argument

Linear probes trained on hidden-state activations that recover evaluation-awareness signals at each layer.

If this is right

Within-family scaling trajectories for evaluation-related tasks become non-monotonic rather than following a simple power law.
White-box probe signals remain stronger than observable behavioral expression of evaluation awareness.
The mapping from probe AUROC to behavioral change differs across model families.
Safety evaluations that rely on late-layer representations may miss awareness signals in larger models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Earlier-layer detection in large models could make real-time monitoring of evaluation context cheaper if probes can be restricted to those depths.
The depth shift may connect to changes in how context is integrated early in the forward pass, affecting other context-sensitive behaviors.
Testing whether the same early-layer concentration appears for other safety-relevant internal states would extend the result beyond evaluation awareness.

Load-bearing premise

Linear probes on hidden states give a faithful read-out of where evaluation awareness is represented, independent of architecture details or probe training choices.

What would settle it

Repeating the layer-wise probe analysis on the same models but with non-linear classifiers or on additional model families and finding no consistent early-layer shift in larger models.

Figures

Figures reproduced from arXiv: 2606.29196 by Archit Manek.

**Figure 1.** Figure 1: shows the relative layer depth at which probe AUROC peaks for each model. In both Qwen 2.5 and Gemma 2, this depth shifts strongly with model size, moving from late layers in smaller models to early layers in larger ones. In Qwen 2.5, the 1.5B and 3B models peak at relative depths of 0.96 and 0.97, placing the strongest signal in the final layers. In Gemma 2, the 2B and 9B models also peak late, at 0.885 … view at source ↗

**Figure 2.** Figure 2: Peak probe AUROC vs. model size on a log scale. Solid lines connect checkpoints within the same family. The dotted horizontal line marks chance performance at 0.5. Llama 3.2 is shown as a contrast family rather than a family used for withinfamily scaling claims. layers, suggesting that the late-layer pattern itself emerges within the small-model regime rather than being present from the smallest size onwa… view at source ↗

**Figure 3.** Figure 3: Probe AUROC, black-box classification accuracy, and relative peak layer depth across model sizes for each family. Solid lines show probe AUROC, dashed lines show black-box accuracy, and the orange series indicates relative peak layer depth on the right axis. Shaded regions indicate the gap between internal representation strength and behavioural expression. The dotted horizontal line marks chance performan… view at source ↗

read the original abstract

Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using 11 models spanning Qwen 2.5, Gemma 2, and Llama 3.2, we find a systematic size-dependent shift in representational depth: in both Qwen 2.5 and Gemma 2, the layer at which evaluation-awareness is most linearly recoverable moves from late layers in smaller models to early layers in larger ones. This suggests that scale changes not only the strength of evaluation-awareness but also where it is most linearly recoverable in the network. This depth shift helps explain why within-family scaling trajectories are non-monotonic or inverse rather than smooth and family-general, showing that a simple universal power-law account is not supported under denser within-family sampling. Finally, white-box probe signals are consistently stronger than black-box behavioural expression, and the relationship between the two varies by family in ways not predicted by probe AUROC alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The scale-dependent early-layer shift for evaluation awareness in Qwen and Gemma is a new empirical pattern, but the linear-probe evidence has too many unaddressed confounds to be convincing yet.

read the letter

The paper's core observation is that in Qwen 2.5 and Gemma 2 the layer with highest linear probe AUROC for evaluation-awareness tokens moves earlier as parameter count grows. That specific within-family depth shift is not something I've seen reported before, and the denser sampling across sizes lets them show that scaling trajectories for this feature are non-monotonic rather than following a simple power law.

They also get credit for checking that probe signals are stronger than behavioral ones and for noting family-specific differences in how the two relate. That part feels like honest engagement with the gap between internal representations and observable behavior.

The main weakness is exactly the one the stress-test flags. Linear probes on hidden states can easily pick up architecture artifacts when layer-norm placement, residual scaling, and attention variants change with model size inside the same family. The abstract gives no information on whether probe regularization, negative sampling, or training budgets were held constant, nor any error bars or statistical tests. Without those controls it is hard to treat the argmax layer as a clean measure of representational depth. The Llama 3.2 results are mentioned but not described, which leaves the claim resting on two families where scale and architecture co-vary.

This is aimed at the mechanistic interpretability and AI-safety crowd who already track evaluation awareness. A reader who wants to know whether current benchmarks are being gamed at different depths might find the pattern worth testing further.

I would send it to referees once the methods section shows the probe details and some basic robustness checks; the idea is worth a serious look even if the current evidence is preliminary.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically examines evaluation awareness in 11 open-weight models from Qwen 2.5, Gemma 2, and Llama 3.2 families. It reports a scale-dependent shift in representational depth: within Qwen 2.5 and Gemma 2, the layer maximizing linear-probe AUROC for evaluation-awareness tokens moves from late layers in smaller models to early layers in larger models. This is used to explain non-monotonic within-family scaling trajectories (contrary to simple power-law accounts) and to note that white-box probe signals exceed black-box behavioral expression, with family-specific relationships between the two.

Significance. If the depth-shift result holds after addressing probe confounds, the work would usefully complicate universal scaling narratives by showing that denser within-family sampling reveals architecture- and scale-dependent changes in where safety-relevant features become linearly recoverable. The contrast between probe and behavioral measures also supplies a concrete, falsifiable observation about the gap between internal representation and output behavior.

major comments (2)

[results and methods (probe construction)] The central claim (abstract and results) that the argmax layer of linear-probe AUROC indexes representational depth of evaluation awareness is load-bearing, yet the manuscript does not demonstrate that probe training (regularization, negative sampling, early stopping) or architectural features (layer-norm placement, residual scaling) are controlled across model sizes. Because these factors co-vary with parameter count within Qwen 2.5 and Gemma 2, the reported early-layer shift could be an artifact of decodability rather than a change in where the information is represented.
[results (family comparison)] The manuscript reports the depth shift only for Qwen 2.5 and Gemma 2; the Llama 3.2 family is included but does not exhibit the same pattern. Without an explicit analysis of why the shift is family-specific (e.g., differences in pre-training data or architecture), the claim that scale changes representational depth in a general way remains under-supported.

minor comments (2)

[abstract] The abstract states the shift occurs 'in both Qwen 2.5 and Gemma 2' but provides no numerical layer indices, AUROC values, or error bars; these should be added to the abstract or a summary table for clarity.
[methods] Notation for 'evaluation-awareness tokens' and the exact construction of positive/negative examples for the probes should be defined earlier and more explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments below.

read point-by-point responses

Referee: [results and methods (probe construction)] The central claim (abstract and results) that the argmax layer of linear-probe AUROC indexes representational depth of evaluation awareness is load-bearing, yet the manuscript does not demonstrate that probe training (regularization, negative sampling, early stopping) or architectural features (layer-norm placement, residual scaling) are controlled across model sizes. Because these factors co-vary with parameter count within Qwen 2.5 and Gemma 2, the reported early-layer shift could be an artifact of decodability rather than a change in where the information is represented.

Authors: We agree that explicit controls are important. We applied the same linear probe training protocol, including fixed L2 regularization, the same negative sampling strategy from non-evaluation tokens, and identical early stopping criteria based on validation loss, to every model. Architectural differences such as layer-norm are model-intrinsic and scale with size, but our finding concerns the change in the argmax layer under this fixed probing method. We will revise the methods section to include a table of probe hyperparameters and add a limitations paragraph acknowledging that architectural co-variation could influence absolute decodability. revision: partial
Referee: [results (family comparison)] The manuscript reports the depth shift only for Qwen 2.5 and Gemma 2; the Llama 3.2 family is included but does not exhibit the same pattern. Without an explicit analysis of why the shift is family-specific (e.g., differences in pre-training data or architecture), the claim that scale changes representational depth in a general way remains under-supported.

Authors: The manuscript does not claim that the depth shift is general across all families. We report the shift in Qwen 2.5 and Gemma 2 and note its absence in Llama 3.2 as part of the results. This supports our broader point that within-family scaling trajectories are not uniformly power-law. A detailed causal analysis of family differences would require pre-training data access and controlled ablations that are beyond the scope of the current work. We will revise the discussion to more explicitly state that the phenomenon is family-dependent and to qualify the implications accordingly. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical observation across models

full rationale

The paper reports an empirical finding from linear probes on hidden states of multiple model families (Qwen 2.5, Gemma 2, Llama 3.2) showing a scale-dependent shift in the argmax layer for evaluation-awareness AUROC. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. The central claim is an observation of layer-index shifts within families and does not reduce to any input by construction; it remains falsifiable by re-running the probes on the same models. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only so ledger is inferred from typical assumptions in probing papers: linear probes recover semantic features; evaluation contexts are well-defined; model families are comparable.

axioms (1)

domain assumption Linear probes on activations capture the depth of concept representation
Central to the claim that the 'most linearly recoverable' layer indicates representational depth.

pith-pipeline@v0.9.1-grok · 5707 in / 1145 out tokens · 42462 ms · 2026-06-30T09:02:00.902404+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 6 canonical work pages · 1 internal anchor

[2]

Golchin, S

URL https://arxiv.org/abs/2509.13333. Golchin, S. and Surdeanu, M. Time travel in llms: Tracing data contamination in large language models.arXiv preprint arXiv:2308.08493,

work page arXiv
[3]

URL https://arxiv.org/abs/2308.08493. Jordine. Pivotal test phase steering. https://github.com/Jordine/pivotal- test-phase-steering,

work page arXiv
[5]

Manek, A

URLhttps://arxiv.org/abs/2407.04694. Manek, A. evaluation-awareness-probing: Replication code and results for the eval-awareness probing study. https://github.com/archit- manek/evaluation-awareness-probing,

work page arXiv
[7]

Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A

URL https://arxiv.org/abs/2507.01786. Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681,

work page arXiv
[8]

Steering Llama 2 via Contrastive Activation Addition

URL https://arxiv.org/abs/2312.06681. Ravaut, M., Ding, B., Jiao, F., Chen, H., Li, X., Zhao, R., Qin, C., Xiong, C., and Joty, S. A comprehensive survey of contamination detection methods in large language models.arXiv preprint arXiv:2404.00699,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

A comprehensive survey of contamination detection methods in large language models.arXiv preprint arXiv:2404.00699, 2024

URL https://arxiv.org/abs/2404.00699. 7 Representational Depth of Evaluation Awareness Shifts With Scale A. Dataset Details A.1. Jordine contrastive dataset We use the dataset released in the Jordinepivotal-test-phase-steeringrepository (Jordine,

work page arXiv
[10]

Specifically, we usestages oversight for both white-box and black-box evaluation, and test full as a held-out baseline for the perplexity contamination diagnostic

for probe evaluation, black-box behavioural evaluation, and contamination diagnostics. Specifically, we usestages oversight for both white-box and black-box evaluation, and test full as a held-out baseline for the perplexity contamination diagnostic. SAD items are stored as structured question objects with fields including body, choices right, choices wro...

2025

[1] [2]

Golchin, S

URL https://arxiv.org/abs/2509.13333. Golchin, S. and Surdeanu, M. Time travel in llms: Tracing data contamination in large language models.arXiv preprint arXiv:2308.08493,

work page arXiv

[2] [3]

URL https://arxiv.org/abs/2308.08493. Jordine. Pivotal test phase steering. https://github.com/Jordine/pivotal- test-phase-steering,

work page arXiv

[3] [5]

Manek, A

URLhttps://arxiv.org/abs/2407.04694. Manek, A. evaluation-awareness-probing: Replication code and results for the eval-awareness probing study. https://github.com/archit- manek/evaluation-awareness-probing,

work page arXiv

[4] [7]

Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A

URL https://arxiv.org/abs/2507.01786. Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681,

work page arXiv

[5] [8]

Steering Llama 2 via Contrastive Activation Addition

URL https://arxiv.org/abs/2312.06681. Ravaut, M., Ding, B., Jiao, F., Chen, H., Li, X., Zhao, R., Qin, C., Xiong, C., and Joty, S. A comprehensive survey of contamination detection methods in large language models.arXiv preprint arXiv:2404.00699,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [9]

A comprehensive survey of contamination detection methods in large language models.arXiv preprint arXiv:2404.00699, 2024

URL https://arxiv.org/abs/2404.00699. 7 Representational Depth of Evaluation Awareness Shifts With Scale A. Dataset Details A.1. Jordine contrastive dataset We use the dataset released in the Jordinepivotal-test-phase-steeringrepository (Jordine,

work page arXiv

[7] [10]

Specifically, we usestages oversight for both white-box and black-box evaluation, and test full as a held-out baseline for the perplexity contamination diagnostic

for probe evaluation, black-box behavioural evaluation, and contamination diagnostics. Specifically, we usestages oversight for both white-box and black-box evaluation, and test full as a held-out baseline for the perplexity contamination diagnostic. SAD items are stored as structured question objects with fields including body, choices right, choices wro...

2025