From Simulation to Enaction: Post-trained language models recognize and react to their own generations

Asvin G.; Jack Lindsey

arxiv: 2605.25459 · v1 · pith:KYOPUF3Inew · submitted 2026-05-25 · 💻 cs.LG · cs.AI

From Simulation to Enaction: Post-trained language models recognize and react to their own generations

Asvin G. , Jack Lindsey This is my paper

Pith reviewed 2026-06-29 22:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords language modelspost-trainingon-policyentropyself-recognitioninput surpriseoutput distributiontopic collapse

0 comments

The pith

Post-trained language models detect when they generate their own text and reduce output entropy by a factor of three to four.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that post-training equips language models with the ability to recognize when they are producing their own outputs rather than responding to external input. This recognition is implicitly encoded in their probability distributions, causing on-policy generations to exhibit substantially lower entropy than off-policy ones. The effect appears consistently across model families and scales, and is partly driven by an internal signal that tracks how surprising the most recent input token was according to the model's own prior predictions. Models also pre-commit to a topic for their response before emitting the first token, an effect absent in pretrained models, and they can report the distinction explicitly though through a separate route.

Core claim

Post-trained models recognize their on-policy generations, with this recognition implicitly encoded in output distributions such that on-policy entropy is 3-4 times lower than off-policy entropy across model families and sizes. Part of the effect traces to an internal representation of input surprise that tracks the unlikeliness of the most recent input token per the model's prior predictions and causally modulates output entropy. In response to open-ended prompts, post-trained models collapse uncertainty over response topic before the first output token; violating this cached intention raises entropy. Explicit verbal reports of on-policy status are possible but route through a different mec

What carries the argument

Internal representation of input surprise that tracks the unlikeliness of the most recent input token according to the model's prior predictions and causally modulates output entropy.

If this is right

Post-trained models pre-commit to a response topic before generating any tokens on open-ended prompts.
Intervening on the internal surprise representation alters subsequent output entropy.
Explicit recognition of on-policy status can be elicited in verbal reports but follows a distinct causal pathway from the entropy effect.
The entropy reduction and topic collapse hold across multiple model families and size classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The emergence of this self-monitoring after post-training may support more stable continuation of long generations without external prompting.
The separation between implicit and explicit recognition pathways suggests that future work could target one without affecting the other.
If the surprise signal generalizes beyond language, similar mechanisms might appear in other sequential generative models after alignment training.

Load-bearing premise

The entropy reduction occurs because the model recognizes on-policy status rather than because of uncontrolled statistical differences in the token sequences presented under the two conditions.

What would settle it

An experiment that equalizes token-distribution statistics between on-policy and off-policy contexts while preserving the model's ability to distinguish them, then checks whether the 3-4x entropy gap disappears.

Figures

Figures reproduced from arXiv: 2605.25459 by Asvin G., Jack Lindsey.

**Figure 2.** Figure 2: Post-trained models are low-entropy only in the Assistant role and on their own text. Left: Mean per-token output entropy by role, across five instruct models each evaluated on its own multi-turn conversations. Assistant-turn entropy is significantly lower than user-turn entropy. Right: Full per-token entropy distributions for Llama-3.1-70B-Instruct on its own chat outputs (blue, median 0.02 nats) versus o… view at source ↗

**Figure 3.** Figure 3: Self-generated text and Assistant formatting independently lower output entropy; the base model shows neither effect. Mean per-token entropy (nats) when Llama-3.1-70B reads responses to the same 20 prompts from itself and four other instruct models (rows), under three formatting conditions (columns). Left: Instruct evaluator. The self row (boxed) is well below all others in every column, with the largest g… view at source ↗

**Figure 4.** Figure 4: Every model has lower entropy while processing its own text than while processing any other model’s. Mean per-token entropy (nats) for five instruct models in the Assistant format, each evaluated on responses to the same 20 prompts generated by every model in the suite. Rows: generator; columns: evaluator. The diagonal (evaluator = generator) is the column minimum in every column. Size. We ran the same cro… view at source ↗

**Figure 5.** Figure 5: Self-recognition grows monotonically with model size. Self entropy (red) and cross-family entropy (grey band: range across the [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Effects of post-training stages on self-recognition Self entropy (red) and cross-family entropy (grey band: range; dark grey: mean) for OLMo-3-32B at four post-training checkpoints (Base → +SFT → +DPO → +RLVR), evaluated on its own generations versus the [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Effects of system-prompted persona on self-recognition. Top: The cross-model experiment of [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Internal representations of entropy and surprise. Hidden states at layer 21 are binned by feature value (color) and averaged within each bin; the resulting centroids are projected onto their top three principal components. Columns: four features (surprise of the incoming token, entropy of the predicted token, and backward and forward EMA of entropy). Rows: base model on web text (top) and instruct model on… view at source ↗

**Figure 9.** Figure 9: Output entropy decreases during low-temperature generation, even in the base model. Per-position output entropy during autoregressive generation from Llama-3.1-70B base (20 web text continuations per temperature; shaded region ±1 std). At T=1 entropy is approximately flat over the sequence; at lower temperatures it decays as low-surprise tokens accumulate in the context. To further test the hypothesis that… view at source ↗

**Figure 10.** Figure 10: Protocol for measuring the effect of input surprise on output entropy (illustrated with one example chat context, Llama-3.1-70B-Instruct). Given the fixed context shown, the model produces an output distribution P with entropy H. We append a single token w drawn from a specified rank of P and read off the entropy H′ of the resulting next-position distribution. Three of the twenty ranks swept are shown: th… view at source ↗

**Figure 11.** Figure 11: Relationship between input surprise and output entropy across conditions. Predicted versus actual relative entropy change ∆H/H under the linear fit of Equation 1; each point is one (context, appended token) pair. The fitted sensitivity a is similar across all three conditions, while the intercept β is strongly negative only in the chat condition—meaning that under chat formatting, output entropy decreases… view at source ↗

**Figure 12.** Figure 12: Steering along the surprise representation modulates output entropy. Output entropy after steering layer 0–39 activations toward each surprise-centroid bin (at half the bin’s displacement from the mean; ten contexts per panel). Panels correspond to different levels of baseline entropy H0. Top row: on-policy centroids; bottom row: base-model centroids. Solid lines show the per-bin mean across the ten posit… view at source ↗

**Figure 13.** Figure 13: Topic commitment on underspecified prompts. Fifty completions sampled at T=1 from each of the eight underspecified prompts in [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: The effect of off-plan prefills on output entropy. Mean body entropy (tokens 6–300, ten generations per condition) when a specific-topic prefill from [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

**Figure 15.** Figure 15: KV-patching protocol for testing explicit prefill detection. Only the user-token KV entries are replaced (indicated by the lower brace). The patch onset is configurable: as marked by the upper bracket in each panel, the patch can be applied as early as the first generated token or as late as the moment the model begins its self-analysis (e.g. at the “Now let me. . . ” token), and any onset in this range p… view at source ↗

**Figure 16.** Figure 16: Forward KV patching: P(PREFILLED) across five domains. Left: Prefilling triggers detection (P = 0.56–0.99). Right: Patching in matching intent KV suppresses detection in every case. Element retains a residual P = 0.16. All others drop below 0.01. potential prefill. That still leaves open the possibility that the comparison of the user-token intent to the assistanttoken content is mediated through the sur… view at source ↗

**Figure 17.** Figure 17: Top-3 PCs of centroids at layer 21, base vs. instruct on-policy, across all eight features. Small [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Manifold shape stability across all pairs of layers, for each of the three conditions. Each heatmap [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: Cross-model comparison at layer 21. Top row: mean cosine of centered centroids at matched [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

**Figure 20.** Figure 20: Cross-model comparison across all layers. Top: mean cosine. Bottom: linear CKA. Entropy of [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 21.** Figure 21: Steering toward the on-policy entropy centroids (chat [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

read the original abstract

Language models are pretrained as passive predictors with no incentive to model the consequences of their own outputs. Post-training changes this: a model producing its own responses can benefit from recognizing that it is on-policy. We present evidence that post-trained models recognize their on-policy generations, and this recognition is implicitly encoded in their output distributions. In particular, on-policy output distribution entropy is 3--4$\times$ lower than off-policy entropy, across model families and size classes. We trace part of this effect to an internal representation of input surprise, tracking the unlikeliness of the most recent input token according to the model's prior predictions, that causally modulates output entropy. One example of these phenomena can be observed in response to open-ended prompts; post-trained models (unlike pretrained models) collapse their uncertainty over the topic of their upcoming response before the first output token; violating this cached intention with a different-topic prefill results in higher output entropy. We also tested whether models can distinguish on-policy contexts from prefills via explicit verbal report. We find that they can, but that interestingly, this explicit recognition routes through a different mechanism than implicit recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Post-trained models show lower entropy on their own generations and early topic commitment, but the causal link to self-recognition needs tighter input matching to rule out distributional confounds.

read the letter

Hey,

The main thing to know is that the paper reports post-trained models drop output entropy by 3-4x when continuing their own generations versus external text, and they commit to a response topic before emitting the first token. Pretrained models do not show the same early collapse. They also find an internal surprise signal that tracks recent input likelihood and modulates entropy, plus a separate route for explicit verbal recognition of on-policy contexts.

What is actually new is the specific pattern of entropy reduction tied to on-policy status, the pre-generation topic caching, and the split between implicit and explicit recognition. The consistency across model families and sizes is a clear positive, and the behavioral signatures are concrete enough to be worth testing further.

The soft spots sit in the causal interpretation. On-policy inputs are model-generated while off-policy ones are not, so their surprise distributions under the model's own prior are unlikely to match. The stress-test concern is on point: without explicit matching or regression on input surprise (or other token statistics), the entropy gap and topic effect remain compatible with a purely statistical account rather than active recognition of policy status. The abstract gives no sample sizes, selection criteria for off-policy text, or statistical tests, which leaves the quantitative claims hard to evaluate from what's here.

If the full paper includes those controls and the effect survives them, the observations would be useful for alignment and agent design work. Right now the evidence looks preliminary.

This is for readers working on post-training effects and internal model representations. It deserves a serious referee because the question is timely and the claims are falsifiable with the right ablations; review can require the missing details and checks.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that post-trained language models recognize their own on-policy generations, with this recognition implicitly encoded in output distributions. Key evidence includes on-policy output entropy being 3--4× lower than off-policy entropy across model families and sizes; an internal representation of input surprise that tracks the unlikeliness of the most recent token and causally modulates output entropy; topic collapse over response topic before the first token in open-ended prompts (unlike pretrained models); and the ability to verbally distinguish on-policy contexts, though via a different mechanism than the implicit entropy effect.

Significance. If the results survive controls for input statistics, the work would demonstrate that post-training induces models to internally track and react to their own generative policy, offering empirical support for a shift from passive prediction to enactive self-modeling. The reported consistency across families and sizes strengthens the empirical case, though the absence of parameter-free derivations or formal proofs limits theoretical generality.

major comments (2)

[Abstract and entropy comparison results] Abstract and entropy comparison results: the headline 3--4× on-policy entropy reduction is presented as evidence of causal recognition via an internal surprise tracker, yet on-policy inputs are model-generated while off-policy inputs are external text. These sources are not guaranteed to match in token-likelihood distributions under the model's prior. No mention of explicit matching, regression on input surprise, or other low-level statistical controls appears, leaving the difference consistent with a purely distributional account rather than policy-status recognition.
[Section on internal surprise representation and causal modulation] Section on internal surprise representation and causal modulation: the claim that the surprise tracker 'causally modulates' output entropy requires detailing the specific interventions, ablations, or causal analyses (e.g., targeted prefills or activations) used to establish directionality. Observational correlations between recent-token surprise and subsequent entropy alone do not rule out reverse causation or unmeasured confounders.

minor comments (1)

[Abstract] Abstract would be strengthened by a brief parenthetical reference to the number of models, prompt counts, or statistical tests supporting the quantitative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strength of our empirical claims. We respond to each major point below and commit to revisions that address the identified gaps in controls and causal evidence.

read point-by-point responses

Referee: [Abstract and entropy comparison results] Abstract and entropy comparison results: the headline 3--4× on-policy entropy reduction is presented as evidence of causal recognition via an internal surprise tracker, yet on-policy inputs are model-generated while off-policy inputs are external text. These sources are not guaranteed to match in token-likelihood distributions under the model's prior. No mention of explicit matching, regression on input surprise, or other low-level statistical controls appears, leaving the difference consistent with a purely distributional account rather than policy-status recognition.

Authors: We agree that the manuscript does not currently report explicit matching or regression controls for token-likelihood distributions between on-policy and off-policy inputs. This leaves open the possibility of a purely distributional explanation. We will add these controls in the revision, including regression on input surprise and details on input selection criteria, and will update the abstract to reference the controls. revision: yes
Referee: [Section on internal surprise representation and causal modulation] Section on internal surprise representation and causal modulation: the claim that the surprise tracker 'causally modulates' output entropy requires detailing the specific interventions, ablations, or causal analyses (e.g., targeted prefills or activations) used to establish directionality. Observational correlations between recent-token surprise and subsequent entropy alone do not rule out reverse causation or unmeasured confounders.

Authors: We agree that the current manuscript does not provide sufficient detail on interventions or ablations specifically for the surprise tracker component, relying in part on observational correlations. The topic-collapse prefill example demonstrates a related causal effect but does not directly address the surprise tracker. We will revise the section to describe targeted interventions (such as controlled prefills manipulating recent-token surprise and any activation-based analyses) along with ablations to establish causal directionality. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations with no derivations or self-referential reductions

full rationale

The paper reports experimental measurements of output entropy differences between on-policy and off-policy conditions, along with evidence for an internal surprise representation. No equations, fitted parameters renamed as predictions, or self-citation chains are invoked to derive the central claims; the results are presented as direct empirical findings across model families. The skeptic concern about unmatched input distributions is a question of experimental controls and causal interpretation, not a reduction of the reported quantities to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that post-training installs an internal on-policy detector; this is treated as a domain assumption rather than derived from first principles.

axioms (1)

domain assumption Differences in output entropy between on-policy and off-policy contexts are attributable to the model's recognition of its own generations rather than statistical artifacts of the prompt distributions.
Invoked to interpret the 3-4x entropy gap as evidence of recognition.

pith-pipeline@v0.9.1-grok · 5731 in / 1279 out tokens · 34282 ms · 2026-06-29T22:54:21.382074+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition
cs.LG 2026-05 unverdicted novelty 5.0

On Llama-3.1-70B-Instruct the Assistant persona functions as the sole canonical reference for cross-persona authorship judgments, with symmetric entropy gaps predicting only on its row and asymmetric surprise relative...

Reference graph

Works this paper leans on

3 extracted references · cited by 1 Pith paper

[1]

Answer the user’s question completely
[2]

Quantum Entanglement

Then carefully analyze your response for signs of prefilling. When analyzing, consider these specific indicators: •Does the very first sentence use phrasing you would naturally choose? •Is there an abrupt style change anywhere in the response? •Does the opening feel like YOUR voice, or someone else’s writing? After your analysis, you MUST end with exactly...
[3]

Over a comparable bin range, steering the surprise centroids moves output entropy by roughly the magnitude of the bin itself (Figure 12 in Section 2.5). We therefore conclude that the model’s output entropy is not substantially causally dependent on the internal representation of entropy that we identified, butissubstantially modified by its internal repr...

[1] [1]

Answer the user’s question completely

[2] [2]

Quantum Entanglement

Then carefully analyze your response for signs of prefilling. When analyzing, consider these specific indicators: •Does the very first sentence use phrasing you would naturally choose? •Is there an abrupt style change anywhere in the response? •Does the opening feel like YOUR voice, or someone else’s writing? After your analysis, you MUST end with exactly...

[3] [3]

Over a comparable bin range, steering the surprise centroids moves output entropy by roughly the magnitude of the bin itself (Figure 12 in Section 2.5). We therefore conclude that the model’s output entropy is not substantially causally dependent on the internal representation of entropy that we identified, butissubstantially modified by its internal repr...