Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Aniket Anand; Dilek Hakkani-T\"ur; Janvijay Singh; Nick Feamster; Zhewei Sun

arxiv: 2605.30526 · v1 · pith:HXQXR3CDnew · submitted 2026-05-28 · 💻 cs.LG · cs.CL

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Aniket Anand , Janvijay Singh , Zhewei Sun , Dilek Hakkani-T\"ur , Nick Feamster This is my paper

Pith reviewed 2026-06-29 08:05 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords post-training alignmentactivation ablationAI style detectionlanguage model generationsresidual contrastsstylistic signatures

0 comments

The pith

Post-training creates a localized AI-like stylistic signature in LLMs that ablation can remove.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether post-training shifts model outputs toward a detectable AI style by comparing human, base-model, and aligned-model text under the same prefixes. Aligned outputs show measurably lower affinity with human text and higher scores on AI detectors. The authors introduce PASTA, which extracts a signature direction from the activation difference between aligned and base models and subtracts that direction during generation. This ablation lowers detection rates across most of eleven models and six detectors, with the reduction not appearing for random directions and with outputs staying coherent. The results indicate that the stylistic effect of post-training is both measurable in activation space and causally removable by targeted intervention.

Core claim

Post-training alignment introduces or amplifies AI-like stylistic regularities that appear as a consistent direction in the residual between aligned-model and base-model activations; ablating this direction during decoding reduces AI-detector scores while preserving relevance and coherence.

What carries the argument

PASTA (Post-training Alignment Signature Targeted Ablation), which estimates an alignment signature vector from aligned-minus-base activation residuals and subtracts a scaled version of that vector from hidden states at each decoding step.

If this is right

Ablation lowers detection rates for most aligned models across multiple detectors.
The reduction transfers across different AI detectors rather than being detector-specific.
Random directions do not reproduce the detection-rate drop.
Ablated outputs remain relevant and coherent while showing increased stylistic variation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same residual-contrast approach could be used to test whether other post-training effects, such as safety refusals, occupy separable directions.
If the signature proves stable across model families, it offers a route to style adjustment without additional training data or fine-tuning.

Load-bearing premise

The activation difference between an aligned model and its base counterpart isolates the post-training stylistic change rather than other unrelated differences between the two models.

What would settle it

If ablating the estimated residual direction produces no drop in AI detection rates or if random directions produce equivalent drops, the claim that the residual isolates a post-training stylistic signature would be false.

Figures

Figures reproduced from arXiv: 2605.30526 by Aniket Anand, Dilek Hakkani-T\"ur, Janvijay Singh, Nick Feamster, Zhewei Sun.

**Figure 1.** Figure 1: PASTA overview. (A) Under matched human-sourced prefixes, aligned-model generations show lower infini-gram affinity to RedPajama+ and higher AI-detection rates than base-model generations, suggesting that post-training shifts generated text away from human-corpus style and toward detector-visible text. (B) PASTA estimates this shift as a layer-wise residual-stream direction by averaging residual states ove… view at source ↗

**Figure 2.** Figure 2: Post-training reduces verbatim affinity to human-corpus text. For each domain, we compare matched continuations generated from the same human-written prefix: Human , Base , and Aligned . Scores are normalized by the human-continuation mean within each domain, so Human = 1.0 by construction. Solid bars show overlap density, the mean longest RedPajama+-matched span across word positions; hatched bars show… view at source ↗

**Figure 3.** Figure 3: Post-training increases AI-detector visibility. We report AI-detection rates for Human , Base , and Aligned continuations. Base and Aligned outputs are generated from the same human-written prefixes and length-controlled before evaluation. Detectors are ordered from left to right by increasing macro accuracy on our detector-quality evaluation. Higher-quality detectors (Appendix D.2) assign much higher AI… view at source ↗

**Figure 4.** Figure 4: PASTA preserves much of the aligned model’s fluency and relevance. Pairwise Claude Sonnet 4.6 judgments compare aligned, base, and PASTAablated continuations for Fluency and Relevance . Left: tie-adjusted win rates WR(X/Y ); right: mean rubricscore differences ∆(X/Y ) with 95% cluster-bootstrap confidence intervals. Higher values favor the first system in each pair. PASTA remains substantially better t… view at source ↗

**Figure 5.** Figure 5: ∆(aligned-base) vs. detector accuracy per data domain. Each point is one detector identified by a color and marker. The horizontal/vertical bars show 95% confidence intervals. Pangram and GPTZero consistently occupy top-right corner: they classify human vs. aligned cleanly and produce a large posttraing ∆. RAIDAR is consistently leads to low accuracy classification. Originality and other OSS detectors’ acc… view at source ↗

**Figure 6.** Figure 6: ∆(aligned-base) vs. detector quality (balanced accuracy) per base-aligned model pair. Here’s why. If you have a service-based business and you don’t have any employees and you don’t take a salary, then you are the company. Both passages advance the same thesis: becoming cash-flow positive does not necessarily mean the founder earns a living wage. The aligned version uses a rhetorical question, a section… view at source ↗

**Figure 7.** Figure 7: LLM-as-judge prompt for pairwise fluency scoring (part 1 of 3; continued in Figures 8 and 9). Part 1 specifies the task framing (independent absolute scoring of two responses, with a separate pairwise winner derived from the scores), the 1–5 rubric anchored on how much problems impede reading, and the lists of attributes that make a response more or less fluent. PASTA. Traditional Chinese culture is thousa… view at source ↗

**Figure 8.** Figure 8: LLM-as-judge prompt for pairwise fluency scoring (part 2 of 3; continued in [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: LLM-as-judge prompt for pairwise fluency scoring (part 3 of 3; continued from [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: LLM-as-judge prompt for pairwise prompt-relevance scoring (part 1 of 3; continued in Figures 11 and 12). Part 1 specifies the relevance components (topic, genre/form, explicit instructions), the 1–5 rubric, and the lists of attributes that make a response more or less relevant. detection rate is much lower than for the aligned model [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: LLM-as-judge prompt for pairwise prompt-relevance scoring (part 2 of 3; continued in [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: LLM-as-judge prompt for pairwise prompt-relevance scoring (part 3 of 3; continued from [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 14.** Figure 14: Domain-specific PASTA does not provide a robust Pareto gain over domain-agnostic PASTA. Each arrow points from the domain-agnostic generation (open circle) to the corresponding domain-specific generation (filled triangle); the Pareto-best direction is upper-left, indicating lower AI-detection and higher generation quality.The x-axis reports mean AI-detection rate across the two primary detectors, Pangram… view at source ↗

**Figure 15.** Figure 15: Originality is a detector-specific exception: in-domain PASTA often lowers detection, but does not yield a consistent quality gain. Each arrow corresponds to one model–domain pair and points from the domain-agnostic generation (open circle) to the corresponding in-domain generation (filled triangle); the Pareto-best direction is upper-left. The x-axis shows AI-detection rate under Originality, while th… view at source ↗

**Figure 17.** Figure 17: Cosine similarity between in-domain PASTA [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗

**Figure 16.** Figure 16: Pairwise cosine similarity between the behav [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

read the original abstract

Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly understood. In this work, we study whether post-training introduces or amplifies AI-like stylistic regularities and whether these regularities have a localized internal signature. To this end, we compare human text, base-model generations, and aligned-model generations under matched human-source prefixes. Aligned generations show lower human-corpus affinity and higher AI-detection rates than base generations, suggesting that post-training shifts generated text away from human-corpus style and toward detector-visible AI-like text. We then introduce PASTA (Post-training Alignment Signature Targeted Ablation), a training-free method that estimates a post-training alignment signature from aligned-base residual contrasts and ablates the corresponding direction during decoding. Across 11 aligned models and 6 AI detectors, PASTA lowers the detection rate for most aligned models; this effect transfers well across detectors and is not reproduced by random directions. Qualitative analysis suggests that PASTA generations remain relevant and coherent while exhibiting greater stylistic variation. Together, these results show that AI-like stylistic effects of post-training can be measured, localized, and causally tested through activation ablation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PASTA gives a simple residual-based ablation that lowers AI detector scores on aligned models, but the residual likely mixes stylistic shifts with other post-training differences.

read the letter

The main thing here is a training-free ablation called PASTA that subtracts the aligned-minus-base activation residual during decoding and reduces how often the output trips AI detectors.

They start by showing that aligned generations under matched prefixes have lower human-corpus affinity and higher detector scores than base-model generations. Then they estimate the signature from those residuals and ablate it. The effect appears across 11 models and 6 detectors, random directions do not produce the same drop, and the edited text stays coherent with more stylistic variation.

The experiment is run at decent scale and the random-direction control is a useful check. The method itself is straightforward and does not require any retraining, which is practical.

The soft spot is whether the residual actually isolates the stylistic part of alignment. Base and aligned models differ in optimization trajectory, data, and capability shifts that can move activations globally and correlate with detector outputs without being about style. The abstract gives no direct controls for that, such as residuals from non-style post-training or tests on unrelated tasks. If the full paper has those checks the claim strengthens; otherwise the link to style remains an assumption rather than a demonstrated isolation.

The directional results are consistent but the summary mentions no error bars or detailed exclusion criteria, so robustness is hard to judge from what is shown.

This is useful for people working on internal effects of alignment and activation editing. It deserves a serious referee because it poses a concrete question and supplies a replicable method with some supporting evidence, even if more specificity controls would help.

Referee Report

1 major / 1 minor

Summary. The paper claims that post-training shifts LLM generations toward AI-like style (lower human-corpus affinity, higher detector scores) relative to base models under matched prefixes, and that this shift has a localized activation signature. The authors introduce PASTA, a training-free method that extracts a direction from aligned-minus-base activation residuals and ablates it at inference; across 11 aligned models and 6 detectors this ablation reduces detection rates while random directions do not, and qualitative inspection indicates retained coherence with greater stylistic variation.

Significance. If the residual direction specifically isolates the stylistic component of alignment, the work supplies a causal, representation-level test of post-training effects on style and a practical intervention for modulating it. The breadth of models and detectors, together with the negative control of random directions, supplies useful empirical grounding.

major comments (1)

[Abstract / PASTA definition] The central claim that the aligned-base residual isolates the post-training stylistic signature (rather than other differences in optimization trajectory, data mixture, or capability) is load-bearing yet unsupported by direct controls. No experiments contrast the style residual against residuals derived from non-style post-training interventions or cross-task activation differences (see Abstract and the description of PASTA).

minor comments (1)

[Abstract] The abstract states that effects are consistent across models and detectors but reports neither error bars, exclusion criteria, nor the precise experimental protocol (layer selection, token aggregation, number of generations per condition).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for stronger controls to support the claim that the aligned-base residual specifically isolates stylistic effects of post-training. We address this point directly below and acknowledge the limitation.

read point-by-point responses

Referee: [Abstract / PASTA definition] The central claim that the aligned-base residual isolates the post-training stylistic signature (rather than other differences in optimization trajectory, data mixture, or capability) is load-bearing yet unsupported by direct controls. No experiments contrast the style residual against residuals derived from non-style post-training interventions or cross-task activation differences (see Abstract and the description of PASTA).

Authors: We agree that the aligned-base residual captures the net effect of post-training, which includes stylistic shifts but may also reflect differences in data mixture, optimization trajectory, or capability gains. Our evidence that the direction is relevant to style comes from the fact that its ablation consistently lowers scores on multiple AI detectors (which are trained on stylistic cues) while random directions produce no such effect, and that the resulting generations retain coherence and task relevance. However, we do not provide direct contrasts against residuals derived from non-style interventions (e.g., continued pre-training or capability-only fine-tuning) or cross-task activation differences. This is a genuine limitation of the current experiments. In the revision we will add an explicit limitations paragraph clarifying that PASTA targets the composite post-training direction and that style is the primary measured outcome rather than a fully isolated component; we will also note the absence of the suggested controls as an avenue for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is empirical and self-contained

full rationale

The paper defines the PASTA signature directly from observed aligned-minus-base activation residuals and tests its ablation empirically against random directions across multiple models and detectors. No equations reduce a claimed prediction to a fitted input by construction, no self-citations are load-bearing for the central claim, and no uniqueness theorems or ansatzes are imported from prior author work. The results rest on external contrasts and controlled ablations rather than internal redefinitions or statistical forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method is described at the level of residual contrasts and ablation without further decomposition.

pith-pipeline@v0.9.1-grok · 5753 in / 1000 out tokens · 25349 ms · 2026-06-29T08:05:29.482809+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christo- pher Olah, Danny Hernandez, Dawn Drain...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

InInternational Conference on Machine Learning (ICML)

Spotting LLMs with binoculars: Zero-shot detection of machine-generated text. InInternational Conference on Machine Learning (ICML). John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2023. A watermark for large language models. InInterna- tional Conference on Machine Learning (ICML). Robert Kirk, Ishita Mediratta, C...

work page arXiv 2023
[3]

Steering Language Models With Activation Engineering

Get to the point: Summarization with pointer- generator networks. InProceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Sha...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

2000:2019

because news leads are informationally dense and a shorter prefix is sufficient to fix the article’s framing. The CNN articles in this dataset were written between April 2007 and April 2015. Scientific Abstracts (sentence-transformers/s2orc, train split). The abstract field provides a paper abstract from the Semantic Scholar Open Research Corpus. The firs...

2007

[1] [1]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christo- pher Olah, Danny Hernandez, Dawn Drain...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

InInternational Conference on Machine Learning (ICML)

Spotting LLMs with binoculars: Zero-shot detection of machine-generated text. InInternational Conference on Machine Learning (ICML). John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2023. A watermark for large language models. InInterna- tional Conference on Machine Learning (ICML). Robert Kirk, Ishita Mediratta, C...

work page arXiv 2023

[3] [3]

Steering Language Models With Activation Engineering

Get to the point: Summarization with pointer- generator networks. InProceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Sha...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

2000:2019

because news leads are informationally dense and a shorter prefix is sufficient to fix the article’s framing. The CNN articles in this dataset were written between April 2007 and April 2015. Scientific Abstracts (sentence-transformers/s2orc, train split). The abstract field provides a paper abstract from the Semantic Scholar Open Research Corpus. The firs...

2007