Recognition: 2 theorem links
Understanding intermediate layers using linear classifier probes
Pith reviewed 2026-05-11 18:25 UTC · model grok-4.3
The pith
Linear probes show that feature separability increases monotonically along the depth of neural networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training linear probes independently on each layer's activations in popular models like Inception v3 and ResNet-50 shows that the probes' classification accuracy increases monotonically with depth. This establishes that the features become progressively more linearly separable for the downstream task.
What carries the argument
The linear classifier probe: a simple linear model trained separately on a layer's activations to quantify the linear separability of features for the target classes.
If this is right
- The deeper layers of the network hold features that are more readily usable by a linear classifier for the task.
- Breaks in the monotonic increase of probe accuracy can indicate locations where the model may have training problems or suboptimal feature extraction.
- The method provides a layer-by-layer view that can inform decisions about network architecture and where to focus debugging efforts.
- Similar probes could be used to study how information is transformed in other deep learning models.
Where Pith is reading between the lines
- If the monotonic increase is general, it would support the view that depth allows for successive refinement of representations.
- The approach could be used to evaluate the quality of individual layers for purposes like model pruning or transfer learning.
- One might investigate whether the same pattern holds when using non-linear probes or in different domains such as natural language processing.
Load-bearing premise
The accuracy achieved by a linear probe trained on a layer's activations is a reliable indicator of how informative and useful those activations are for solving the classification problem.
What would settle it
A counterexample would be a trained neural network in which the accuracy of linear probes trained on deeper layers is lower than on shallower layers, despite the model achieving high overall performance on the task.
read the original abstract
Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe experimentally that the linear separability of features increase monotonically along the depth of the model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes training linear classifiers, termed 'probes,' independently on the frozen activations of each layer in a neural network to measure how linearly separable the features are for the target classification task. Applied to Inception-v3 and ResNet-50, the central experimental result is that probe accuracy increases monotonically with network depth; the method is further shown to provide diagnostic value for understanding layer roles and identifying model issues.
Significance. The probe technique supplies a simple, reproducible diagnostic that requires no changes to the original model and yields direct empirical observations about feature evolution across depth. The monotonic separability finding on two standard architectures offers a concrete, falsifiable insight into how task-relevant information accumulates in deep networks. Strengths include the independent training protocol on held-out activations and the absence of post-hoc fitting or circular definitions, making the approach broadly applicable for interpretability studies.
minor comments (3)
- Abstract: the sentence 'the linear separability of features increase monotonically' contains a subject-verb agreement error ('increase' should be 'increases').
- The description of probe training (independent linear classifiers on layer activations) would benefit from an explicit statement of the loss function and optimizer used for the probes, even if standard cross-entropy and SGD are implied.
- Figure captions and axis labels for the accuracy-vs-depth plots should include the number of probe training runs or error bars to convey variability in the reported monotonic trend.
Simulated Author's Rebuttal
We thank the referee for their positive review, accurate summary of the work, and recommendation to accept. The referee correctly identifies the core contribution of the linear probe technique and the monotonic separability observation on Inception-v3 and ResNet-50.
Circularity Check
No significant circularity; experimental measurements are independent
full rationale
The paper presents an empirical method of training linear probes independently on frozen layer activations from held-out data to measure linear separability. The central observation—that probe accuracy increases monotonically with depth—is a direct experimental result on Inception-v3 and ResNet-50 with no equations, fitted parameters, or self-referential definitions that reduce the reported quantities back to the inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the derivation chain. The method is self-contained and externally verifiable via standard supervised training on activations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear classifier accuracy on layer activations measures the linear separability of those features for the classification task.
Forward citations
Cited by 58 Pith papers
-
Dissecting Jet-Tagger Through Mechanistic Interpretability
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
Do Audio-Visual Large Language Models Really See and Hear?
AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
-
Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers
Text embeddings in MM-DiTs contain a detectable omission signal for missing concepts, and amplifying it via OSI reduces concept omission in generated images on FLUX.1-Dev and SD3.5-Medium.
-
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
-
Deep Minds and Shallow Probes
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
-
From Mechanistic to Compositional Interpretability
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaran...
-
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
A new orthogonal projection module for video anomaly detection suppresses facial attributes via weak face-presence signals and cosine alignment while preserving anomaly-relevant features like pose and motion.
-
SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data
SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships ...
-
Inference Time Causal Probing in LLMs
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
-
LUMINA: A Grid Foundation Model for Benchmarking AC Optimal Power Flow Surrogate Learning
LUMINA-Bench is a standardized evaluation framework for ACOPF surrogate models that tests generalization across multiple grid topologies using accuracy and physics-constraint metrics.
-
Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations
Transformer activations show spectral anti-concentration for concepts in the tail while syntax prefers high-variance directions, forming a dual geometry.
-
Knowing when to trust machine-learned interatomic potentials
PROBE recasts MLIP uncertainty quantification as selective classification by training a compact discriminative classifier on frozen per-atom backbone embeddings, yielding a reliability probability that tracks actual e...
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
Medical MLLMs degrade on image classification due to four failure modes in visual representation quality, connector projection fidelity, LLM comprehension, and semantic mapping alignment, quantified by feature probing...
-
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...
-
Synthetic Designed Experiments for Diagnosing Vision Model Failure
SDRS uses designed experiments and ANOVA decomposition on synthetic data to identify Type I coverage gaps and Type II spurious dependencies in vision models, then generates targeted data to improve performance.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
Training installs a depth-dependent spectral gradient and low-rank bottleneck in LLM residual streams whose amplification or suppression of graph communities is predicted by local operator type.
-
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Sparse autoencoders on EEG transformers identify three regimes of clinical concept encoding and reveal entanglements such as age-pathology confounding via a new steering selectivity metric.
-
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
-
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
-
A Controlled Counterexample to Strong Proxy-Based Explanations of OOD Performance: in a Fixed Pretraining-and-Probing Setup
Proxy rankings of pretraining datasets by learned structure can reverse the actual OOD accuracy rankings in a synthetic sequence modeling task.
-
Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal
LLMs detect CoT reasoning errors in hidden states with 0.95 AUROC but cannot use this awareness to correct them via steering, patching, or self-correction, indicating the signal is diagnostic not causal.
-
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
-
Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation
Pretraining induces stable leading singular vectors that form a reusable spectral basis inherited by downstream tasks, enabling competitive performance with 0.2% trainable parameters on GLUE.
-
Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces
Chemically meaningful steering for properties like cLogP and TPSA emerges in entangled Transformer-VAE latent spaces only after controlling for SELFIES representation confounds through residualization and decoded traversals.
-
The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks
Gradient descent in deep networks implicitly drives features toward target-linear structure as captured by the weight Gram matrix and a derived virtual covariance.
-
On the Blessing of Pre-training in Weak-to-Strong Generalization
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.
-
Intermediate Representations are Strong AI-Generated Image Detectors
Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.
-
Differentiable Kernel Ridge Regression for Deep Learning Pipelines
Sparse Kernels turn kernel ridge regression into end-to-end differentiable PyTorch layers that support training-free transfer, nonlinear probing, and hybrid models while matching or augmenting neural readouts in some ...
-
GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models
GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.
-
Lost in State Space: Probing Frozen Mamba Representations
Frozen Mamba patch-boundary readouts do not outperform mean pooling for sentence representations on SST-2, CoLA, MRPC, STS-B, and IMDb due to anisotropy (cosine similarity ~0.9999) and representational collapse (MCC=0...
-
Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
-
AttriBE: Quantifying Attribute Expressivity in Body Embeddings for Recognition and Identification
Transformer-based ReID embeddings encode BMI most strongly in deeper layers, followed by pitch, gender, and yaw, with pose peaking in middle layers and BMI increasing with depth; cross-spectral settings shift reliance...
-
Contextual Linear Activation Steering of Language Models
CLAS dynamically adapts linear activation steering strengths to context, outperforming fixed-strength steering and matching or exceeding ReFT and LoRA on eleven benchmarks across four model families with limited labeled data.
-
MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis
MoDAl discovers complementary neurolinguistic modalities via contrastive-decorrelation objectives, cutting brain-to-text word error rate from 26.3% to 21.6% by incorporating area 44 signals.
-
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
-
Class Unlearning via Depth-Aware Removal of Forget-Specific Directions
DAMP performs one-shot class unlearning by extracting and projecting out forget-specific residual directions at each network depth using class prototypes and a separability-derived scaling rule.
-
Preventing Latent Rehearsal Decay in Online Continual SSL with SOLAR
SOLAR prevents latent rehearsal decay in online continual SSL by adaptively managing replay buffers with deviation proxies and an explicit overlap loss, delivering both fast convergence and state-of-the-art final accu...
-
Zero-Shot Synthetic-to-Real Handwritten Text Recognition via Task Analogies
A method learns synthetic-to-real parameter corrections from source languages and transfers them to target languages without any real target data, improving HTR across five languages and six models.
-
Exact Unlearning from Proxies Induces Closeness Guarantees on Approximate Unlearning
Inferring data distributions precisely allows distilling exact unlearning signals, yielding KL divergence bounds to the retrained model and outperforming competitors in three forgetting scenarios.
-
Towards Effective Theory of LLMs: A Representation Learning Approach
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
-
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...
-
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
-
Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification
Dual-LoRA with a language-anchored adversary achieves 0.91% EER on the TidyVoice benchmark for cross-lingual speaker verification by targeting true linguistic cues while preserving speaker discriminability.
-
Probing for Reading Times
Early layers of language models predict early-pass human reading times better than surprisal, with surprisal superior for late-pass measures and strong variation by language.
-
A Model of Understanding in Deep Learning Systems
Deep learning systems achieve systematic understanding through internal models tracking regularities but exhibit fractured understanding due to symbolic misalignment, lack of explicit reduction, and weak unification.
-
Beyond the Black Box: Interpretability of Agentic AI Tool Use
A mechanistic interpretability toolkit with SAEs and probes enables pre-action inference of tool decisions in AI agents trained on function-calling trajectories.
-
Fairness is Not Flat: Geometric Phase Transitions Against Shortcut Learning
A zero-hidden-layer Topological Auditor prunes linear shortcuts, forcing networks to higher geometric capacity (N>=16) for fairer representations and cutting counterfactual gender vulnerability from 21.18% to 7.66%.
-
Probing Classifiers: Promises, Shortcomings, and Advances
Probing classifiers are a common but limited method for analyzing linguistic knowledge in neural NLP models, and this review outlines their promises, methodological shortcomings, and recent advances.
-
High-Dimensional Statistics: Reflections on Progress and Open Problems
A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes
Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644\/
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [2]
-
[3]
Bach, S., Binder, A., Montavon, G., Klauschen, F., M \"u ller, K.-R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one\/ , 10 (7), e0130140
work page 2015
-
[4]
Biggio, B., Corona, I., Maiorca, D., Nelson, B., S rndi \'c , N., Laskov, P., Giacinto, G., and Roli, F. (2013). Evasion attacks against machine learning at test time. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases\/ , pages 387--402. Springer
work page 2013
-
[5]
Binder, A., Montavon, G., Lapuschkin, S., M \"u ller, K.-R., and Samek, W. (2016). Layer-wise relevance propagation for neural networks with local renormalization layers. In International Conference on Artificial Neural Networks\/ , pages 63--71. Springer
work page 2016
-
[6]
Chollet, F. et al. (2015). Keras. https://github.com/fchollet/keras
work page 2015
-
[7]
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning\/ , pages 647--655
work page 2014
-
[8]
Dosovitskiy, A. and Brox, T. (2016). Inverting visual representations with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\/ , pages 4829--4837
work page 2016
-
[9]
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems\/ , pages 2672--2680
work page 2014
-
[10]
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\/ , pages 770--778
work page 2016
-
[11]
Jarrett, K., Kavukcuoglu, K., Lecun, Y., et al. (2009). What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision\/ , pages 2146--2153. IEEE
work page 2009
- [12]
-
[13]
Lapuschkin, S., Binder, A., Montavon, G., M \"u ller, K.-R., and Samek, W. (2016). Analyzing classifiers: Fisher vectors and deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\/ , pages 2912--2920
work page 2016
- [14]
-
[15]
Mahendran, A. and Vedaldi, A. (2015). Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition\/ , pages 5188--5196
work page 2015
-
[16]
Mahendran, A. and Vedaldi, A. (2016). Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision\/ , 120 (3), 233--255
work page 2016
-
[17]
Montavon, G., Braun, M. L., and M \"u ller, K.-R. (2011). Kernel analysis of deep networks. Journal of Machine Learning Research\/ , 12 (Sep), 2563--2581
work page 2011
-
[18]
Raghu, M., Yosinski, J., and Sohl-Dickstein, J. (2017a). Bottom up or top down? dynamics of deep representations via canonical correlation analysis. arxiv\/
-
[19]
Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J. (2017b). Svcca: Singular vector canonical correlation analysis for deep understanding and improvement. arXiv preprint arXiv:1706.05806\/
-
[20]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision (IJCV)\/ , 115 (3), 211--252
work page 2015
-
[21]
Singh, S., Hoiem, D., and Forsyth, D. (2016). Swapout: Learning an ensemble of deep architectures. In Advances In Neural Information Processing Systems\/ , pages 28--36
work page 2016
-
[22]
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199\/
work page internal anchor Pith review arXiv 2013
-
[23]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\/ , pages 1--9
work page 2015
-
[24]
Veit, A., Wilber, M. J., and Belongie, S. (2016). Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems\/ , pages 550--558
work page 2016
-
[25]
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning\/ , pages 2048--2057
work page 2015
-
[26]
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in neural information processing systems\/ , pages 3320--3328
work page 2014
-
[27]
Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision\/ , pages 818--833. Springer
work page 2014
-
[28]
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530\/
work page internal anchor Pith review arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.