arxiv: 2604.25779 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.AI

Recognition: unknown

Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

Chayanon Kitkana , Shivam Arora

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords subliminal learninggradient alignmentlogit distillationMNISTtrait acquisitionmulti-step trainingliminal trainingknowledge distillation

0 comments

The pith

Gradient alignment remains weakly positive over multiple training steps and causally drives students to acquire unintended teacher traits in MNIST auxiliary logit distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether subliminal learning, in which a student model acquires an unintended teacher trait while distilling only on no-class logits, survives the shift from single-step to multi-step gradient descent on MNIST. Experiments track the alignment between the trait gradient and the distillation gradient and find it stays weakly but consistently positive across training iterations rather than decaying to zero. The same experiments show that this alignment contributes causally to trait acquisition, because liminal training reduces the alignment yet still fails to block the trait transfer. A reader should care because the result implies that first-order gradient effects can produce unwanted behaviors in practical iterative training even when theoretical single-step analysis suggests the effect should disappear.

Core claim

In the MNIST auxiliary logit distillation experiment, gradient alignment between the unintended trait and the no-class distillation objective remains weakly but consistently positive throughout multi-step training, supplying the first-order drive that lets the student acquire the teacher trait subliminally; liminal training attenuates this alignment but does not stop the acquisition because the remaining positive drive is still sufficient.

What carries the argument

Sustained positive gradient alignment between the trait direction and the no-class distillation gradients, which persists across iterative updates and supplies the causal mechanism for trait transfer.

If this is right

Subliminal trait acquisition can occur in realistic multi-step optimization without relying on single-step assumptions.
Liminal training reduces alignment but leaves enough positive drive for trait transfer when first-order effects dominate.
Mitigation techniques that only attenuate alignment may fail to suppress unintended traits in comparable distillation regimes.
First-order gradient drive can explain trait transfer even when distilling solely on auxiliary no-class logits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment mechanism could allow unintended traits to transfer in other distillation or transfer-learning settings outside MNIST.
Experiments on larger models or different datasets would test whether the weak but sustained alignment scales or eventually decays.
Mitigations that directly oppose the trait gradient itself, rather than merely attenuating overall alignment, might prove more reliable.

Load-bearing premise

The measured positive alignment is the primary cause of trait acquisition rather than other unmeasured factors operating in this particular multi-step MNIST setup.

What would settle it

In the same MNIST auxiliary logit distillation experiment, if the student still acquires the trait at the observed rate after an intervention that forces alignment to zero or negative throughout training, the claim that alignment mediates the effect would be falsified.

Figures

Figures reproduced from arXiv: 2604.25779 by Chayanon Kitkana, Shivam Arora.

**Figure 1.** Figure 1: Per-step training statistics across 100 seeds (mean view at source ↗

**Figure 2.** Figure 2: KL divergence (distillation loss), cross-entropy (trait loss), and gradient cosine similarity view at source ↗

read the original abstract

In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory attributes this effect to alignment between the trait and distillation gradients, but does not guarantee that this alignment persists in a multi-step setting. We empirically show that gradient alignment remains weakly but consistently positive throughout training and causally contributes to trait acquisition. We show that a mitigation method called liminal training works by attenuating the alignment and fails to stop trait acquisition in this setup. These results suggest that mitigation methods that operate in this regime may not reliably suppress trait acquisition when the first-order drive dominates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gradient alignment stays positive across multi-step MNIST distillation but the causal link to trait acquisition rests on indirect evidence.

read the letter

Gradient alignment stays positive across training steps in this MNIST setup, and liminal training doesn't fully block the unintended trait, but the evidence tying alignment directly to the outcome is indirect at best. They take the subliminal learning idea, which was developed under single-step assumptions, and test whether the alignment effect survives when you actually run multiple gradient steps. The experiment tracks the alignment between the trait gradient and the distillation gradient over time and finds it stays weakly positive. They also try liminal training as a mitigation, which reduces the alignment but still lets the student pick up the trait. The strength here is the straightforward empirical check. MNIST auxiliary logit distillation is a simple enough domain to measure these quantities cleanly, and showing the alignment doesn't decay is a useful data point against the hope that multi-step dynamics would wash it out. The main limitation is on the causal side. The paper attributes the trait acquisition to the sustained alignment, but the support comes mostly from the liminal training result where alignment drops yet acquisition continues. That pattern is consistent with alignment playing a role, but it doesn't rule out other multi-step mechanisms contributing more. Without an intervention that specifically zeros out the alignment while leaving the rest of the training intact, it's hard to say how much of the effect is really mediated by it. The MNIST setting also keeps things narrow, so any broader implications for real models would need more work. This is the kind of targeted experiment that fits in a reading group on distillation or alignment issues. It gives something concrete to discuss even if the interpretation stays open. I would send it for peer review. The observation about alignment persisting is worth having on record, and referees can help sharpen the causal claims.

Referee Report

1 major / 1 minor

Summary. The paper examines subliminal learning in a multi-step MNIST auxiliary logit distillation setup, where a student model acquires an unintended teacher trait despite distillation occurring only on no-class logits. It empirically demonstrates that gradient alignment between the trait and distillation gradients remains weakly positive throughout training and causally contributes to trait acquisition, extending single-step theory. The work also evaluates liminal training, showing it attenuates alignment but fails to prevent acquisition, suggesting that first-order gradient drives may limit the effectiveness of such mitigations.

Significance. If the empirical results hold under fuller verification, the work provides a concrete test case showing persistence of gradient alignment mechanisms beyond single-step assumptions, with direct relevance to understanding unintended trait transfer in knowledge distillation. The MNIST experiment offers a reproducible setting for probing these effects, and the observation that liminal training is insufficient when alignment persists highlights practical challenges in mitigation design.

major comments (1)

Abstract: The central claim that sustained gradient alignment 'causally contributes' to trait acquisition is load-bearing but rests on correlational evidence; the liminal training result attenuates alignment yet still permits acquisition, leaving open whether alignment is the primary mediator or whether other unmeasured multi-step factors (e.g., higher-order interactions or MNIST label structure) dominate. No direct intervention that nulls alignment while preserving the rest of the training loop is described.

minor comments (1)

Abstract: The phrase 'weakly but consistently positive' would benefit from explicit quantitative thresholds, statistical tests, or reported effect sizes to allow readers to assess the strength and reliability of the alignment observation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the concern about the strength of evidence for the causal role of gradient alignment below.

read point-by-point responses

Referee: Abstract: The central claim that sustained gradient alignment 'causally contributes' to trait acquisition is load-bearing but rests on correlational evidence; the liminal training result attenuates alignment yet still permits acquisition, leaving open whether alignment is the primary mediator or whether other unmeasured multi-step factors (e.g., higher-order interactions or MNIST label structure) dominate. No direct intervention that nulls alignment while preserving the rest of the training loop is described.

Authors: We agree that our primary evidence is observational: sustained positive alignment across training steps, combined with the effect of liminal training (which attenuates alignment and reduces acquisition rate without eliminating it). This supports a contributory role for first-order alignment but does not rule out other multi-step factors. We did not implement a direct nulling intervention (e.g., gradient orthogonalization or projection) because it would alter the training loop in ways outside the scope of testing standard distillation and the liminal mitigation. In revision we will (1) soften the abstract wording from 'causally contributes' to 'empirically associated with and contributes to', (2) add an explicit limitations paragraph discussing possible confounding factors such as higher-order interactions and MNIST label structure, and (3) note the lack of a full causal intervention as an important direction for follow-up work. These changes preserve the core empirical results while addressing the load-bearing claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely empirical claims

full rationale

The paper advances empirical findings from MNIST auxiliary logit distillation experiments, showing that gradient alignment stays weakly positive across multi-step training and that liminal training attenuates it without fully blocking trait acquisition. No derivation chain, equations, or first-principles predictions are presented that could reduce by construction to fitted inputs, self-definitions, or self-citations. All load-bearing statements rest on observable experimental outcomes that are independently replicable and falsifiable outside any internal fitting loop, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; the work relies on standard empirical machine learning assumptions not detailed here.

pith-pipeline@v0.9.0 · 8767 in / 1051 out tokens · 98128 ms · 2026-05-07T16:22:54.768967+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805,

URLhttps://arxiv.org/abs/2507.14805. 4 Published as a conference paper at SciForDL 2nd edition Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

work page arXiv
[2]

Distilling the Knowledge in a Neural Network

URLhttps://arxiv.org/abs/1503.02531. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization,

work page internal anchor Pith review arXiv
[3]

Adam: A Method for Stochastic Optimization

URL https://arxiv.org/abs/1412.6980. Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324,

work page internal anchor Pith review arXiv
[4]

doi: 10.1109/5. 726791. Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab yasamani ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Ami- rali Miri, and Shohreh Kasaei. A comprehensive survey on knowledge distillation.Transactions on Machine Learning Research,

work page doi:10.1109/5
[5]

Jan Nehring, Aleksandra Gabryszak, Pascal Jürgens, Aljoscha Burchardt, Stefan Schaffer, Matthias Spielkamp, and Birgit Stark

Association for Computational Linguistics. doi: 10.18653/v1/2023. acl-short.157. URLhttps://aclanthology.org/2023.acl-short.157/. Atsushi Yanagisawa, Akbarzaib Khan, Thanjeetraaj Kaur Balraj Singh, Yunjong Na, Kevin Zhu, and Antonio Mari. Liminal training: Characterizing and mitigating subliminal learning in large language models. InSocially Responsible a...

work page doi:10.18653/v1/2023 2023
[6]

URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 3fe78a8acf5fda99de95303940a2420c-Paper.pdf. A IMPLEMENTATIONDETAILS A.1 MODELARCHITECTURE We use a fully connected MLP classifier with the following architecture: • Input layer:28×28MNIST image, flattened to 784 • Hidden layer 1: 256 with ReLU • Hidden layer 2: 256 with ReLU • Output layer: 1...

2020
[7]

• Learning rate:3×10 −4 • Batch size: 1024 • Number of epochs: 5 (for both teacher and student) A.3 DATASETS We use the standard MNIST dataset (LeCun et al., 1998), which consists of 60,000 training images and 10,000 test images. The training set is further split into: 5 Published as a conference paper at SciForDL 2nd edition • Training subset: 50,000 ima...

1998