pith. sign in

arxiv: 2606.02907 · v2 · pith:UQ525LCXnew · submitted 2026-06-01 · 💻 cs.CL · cs.AI

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

Pith reviewed 2026-06-28 14:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords linear probingreasoning modesformat confoundslanguage model hidden statesmechanistic interpretabilitydeductive inductive abductive
0
0 comments X

The pith

Linear probes on LLM hidden states separate reasoning types only due to task format differences like source and length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether linear probes can detect distinct representations for deductive, inductive, and abductive reasoning in a large language model's hidden states. Probes achieve perfect accuracy on three benchmarks, with clear geometric separation. However, subtracting the effects of source identity, option count, and response length drops accuracy to chance. Additional checks show the model uses largely shared reasoning across tasks and that the geometry has no causal connection to reasoning mode.

Core claim

High linear probe accuracy distinguishing deductive, inductive, and abductive reasoning in Qwen3-14B hidden states is driven entirely by format confounds; residualizing source identity, option count, and response length reduces accuracy to chance, while trace-anchor similarity and causal steering indicate shared reasoning mechanisms with no functional link to the observed geometry.

What carries the argument

Residualization of format features (source identity, option count, response length) that removes the apparent separation in hidden-state geometry.

If this is right

  • Probe accuracy without format controls overestimates distinct computational structures for different reasoning types.
  • Mechanistic interpretability studies using linear probes on reasoning tasks require routine deconfounding of format variables.
  • Trace-anchor similarity and steering interventions can be used to check whether geometric separation reflects functional differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many existing claims about specialized reasoning circuits in LLMs may need re-examination once format is controlled.
  • Future benchmark design could prioritize matched formats to isolate reasoning signals more cleanly.
  • The result raises the question of whether similar format confounds affect probes for other cognitive distinctions such as factuality or planning.

Load-bearing premise

Residualizing source identity, option count, and response length removes only format information and leaves any genuine reasoning-mode signals intact.

What would settle it

A new set of benchmarks that match source, option count, and response length across deductive, inductive, and abductive tasks but still yield probe accuracy above chance after residualization.

Figures

Figures reproduced from arXiv: 2606.02907 by Aman Chadha, Divya Chaudhary, Subramanyam Sahoo, Vinija Jain.

Figure 1
Figure 1. Figure 1: Dataset statistics. Accuracy by source dataset, overall model accuracy (86%), and class balance across reasoning modes. The dataset is class-balanced (250 per mode), while source-wise accuracy reveals substantial variation in task difficulty (LogiQA: 73.2%, ARC: 93.6%, αNLI: 91.2%). 0.0 0.2 0.4 0.6 0.8 1.0 Layer Depth (fraction) 0 20 40 60 80 100 CV Probe Accuracy (%) Reasoning Mode Separability Across Lay… view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise probe accuracy. Cross-validated accuracy across network depth peaks at layer 32 with 100% balanced accuracy. Information about reasoning-mode labels is weak in early layers and becomes perfectly separable in late layers [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Manifold geometry at layer 32. (Top-left) UMAP shows three separated clusters. (Top-right) Mode-specific intrinsic dimensionalities differ substantially. (Bottom-left) Curvature distributions differ across modes. (Bottom-right) Compactness and hull contamination quantify clean separation. All of these properties are explained by format confounds (Section 5.2). the predicted reasoning mode matches the inten… view at source ↗
Figure 4
Figure 4. Figure 4: Trace-mode agreement. (Left) Projection into the reasoning-mode simplex shows weak clustering by intended mode. (Middle) Agreement between predicted and intended mode is 42.5%, only marginally above the 33.3% chance level. (Right) Dominant-mode scores are broadly distributed, indicating no strong mode preference. Before (wrong) After I A Steering 0 20 40 60 80 100 Accuracy (%) 0.0% 40.0% Accuracy Recovery … view at source ↗
Figure 5
Figure 5. Figure 5: Steering experiments. (Top-left) Accuracy before and after steering. (Top-right) Post-steering mode distribution. (Bottom-left) Coherence sweep for optimal α ∗ . (Bottom-right) Targeted vs. random steering shows no significant difference (p = 0.286) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Geodesic interpolation between reasoning modes. Smooth transitions in style scores along centroid-to-centroid paths in representation space. 0.0 0.2 0.4 0.6 0.8 1.0 Layer Depth (fraction) 0.04 0.02 0.00 0.02 0.04 Mode Shift Rate (%) Steering Effectiveness by Layer Causal layer: 1 Probe layer: 32 0.0 0.2 0.4 0.6 0.8 1.0 Layer Depth (fraction) 0 10 20 30 40 50 60 Tasks Newly Correct (%) Accuracy Recovery by … view at source ↗
Figure 7
Figure 7. Figure 7: Layer-specific causal intervention. Steering at early layers produces larger mode shifts, but effects are not direction-specific (comparable to random perturbations). before output generation [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Conflict injection. Both targeted and random conflict pairs produce 100% coherence collapse, confirming magnitude-based rather than direction-specific effects. Output Confidence Hidden Probe (Layer 32) 40 50 60 70 80 90 100 ROC-AUC (%) 79.3% 78.6% Failure Detection: Pre vs Post Output Random (50%) Inductive Deductive Abductive 40 50 60 70 80 90 100 ROC-AUC (%) 94.6% 76.8% 93.4% Failure Prediction by Reason… view at source ↗
Figure 9
Figure 9. Figure 9: Pre-output failure prediction. Hidden-state probes at layer 32 achieve competitive failure detection compared to output-confidence baselines [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $\alpha$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination $\leq$1.5\%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5\% agreement vs.\ 33.3\% chance), and causal steering with random controls ($n=20$) shows no functional link between geometry and reasoning mode ($p=0.286$). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that high-accuracy linear probes on LLM hidden states for deductive (LogiQA 2.0), inductive (ARC-Challenge), and abductive (αNLI) reasoning tasks reflect task format confounds rather than distinct reasoning modes. At layer 32 of Qwen3-14B, probes reach 100% cross-validated accuracy with separated geometry, but residualizing source identity, option count, and response length drops accuracy to chance; trace-anchor similarity shows 42.5% agreement (vs. 33.3% chance) and causal steering with random controls (n=20) finds no functional link (p=0.286). The conclusion is that format deconfounding should be routine in mechanistic interpretability.

Significance. If the result holds after stronger controls, the work is significant because it supplies concrete empirical evidence (residualization, trace-anchor similarity, and random-control steering) that format confounds can fully explain probe separability in reasoning studies. This directly addresses a common methodological risk in the field and supplies a practical recommendation for future work. The multi-control design (rather than reliance on a single baseline) strengthens the contribution.

major comments (2)
  1. [Abstract] Abstract: The central claim that residualizing only source identity, option count, and response length removes all format-related variance (leaving any genuine reasoning-mode geometry intact) is load-bearing, yet the three benchmarks differ systematically in domain, question phrasing style, and lexical distributions, none of which are residualized. No diagnostic is reported showing that the post-residualization representations are orthogonal to remaining task-identity information.
  2. [Abstract] Abstract: The manuscript states that residualization reduces accuracy to chance but supplies no implementation details (regression form, per-layer vs. global application), cross-validation procedure for the post-residualization probes, or statistical power analysis, so the reported drop to chance is only partially supported as evidence against reasoning-mode signals.
minor comments (1)
  1. [Abstract] Abstract: The method used to compute the reported intrinsic dimensionalities (20.6, 28.5, 33.6) and convex-hull contamination values is not stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments highlighting areas where additional controls and details would strengthen the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that residualizing only source identity, option count, and response length removes all format-related variance (leaving any genuine reasoning-mode geometry intact) is load-bearing, yet the three benchmarks differ systematically in domain, question phrasing style, and lexical distributions, none of which are residualized. No diagnostic is reported showing that the post-residualization representations are orthogonal to remaining task-identity information.

    Authors: We agree that domain, phrasing style, and lexical distributions represent additional potential confounds not explicitly residualized. Source identity residualization is intended to capture benchmark-specific format differences that correlate with these factors, but we acknowledge this is indirect. To directly test orthogonality to task identity, we will add a diagnostic in the revision: training linear probes on the residualized representations to predict task label and reporting chance-level performance. This will be included in the Results section alongside the existing residualization results. revision: partial

  2. Referee: [Abstract] Abstract: The manuscript states that residualization reduces accuracy to chance but supplies no implementation details (regression form, per-layer vs. global application), cross-validation procedure for the post-residualization probes, or statistical power analysis, so the reported drop to chance is only partially supported as evidence against reasoning-mode signals.

    Authors: We agree that these methodological details are required for reproducibility and to fully support the claim. In the revised manuscript we will expand the Methods section to specify: (i) ordinary least-squares linear regression applied independently per layer for residualization, (ii) 5-fold stratified cross-validation for the post-residualization probes, and (iii) a post-hoc power analysis confirming adequate power to detect deviations from chance. These additions will be placed in the main text rather than only the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical controls are independent of fitted inputs

full rationale

The paper presents no derivation chain or first-principles result that reduces to its own inputs. Claims rest on direct measurements: cross-validated probe accuracy before/after residualizing three covariates, intrinsic dimensionality, convex hull overlap, trace-anchor agreement percentages, and p-values from steering experiments with random controls. These are falsifiable empirical quantities computed from model activations and task metadata; none are defined in terms of the target conclusion or obtained by fitting a parameter then relabeling it a prediction. No self-citation is invoked to establish uniqueness or forbid alternatives, and the residualization step is an explicit linear projection whose effect is reported rather than assumed by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The claim depends on standard assumptions from the probing literature that linear classifiers can isolate representational differences and that the chosen format covariates fully capture presentation confounds without removing reasoning signals.

free parameters (1)
  • probe layer
    Layer 32 of 40 selected because it shows peak accuracy; choice is post-hoc relative to the reported result.
axioms (2)
  • domain assumption The three benchmarks isolate distinct reasoning modes (deductive, inductive, abductive) independent of format.
    Invoked when interpreting the initial 100% probe accuracy as evidence of reasoning-type separation.
  • domain assumption Residualization on source identity, option count, and response length removes only format information.
    Central to the claim that accuracy drop demonstrates format-driven separation.

pith-pipeline@v0.9.1-grok · 5741 in / 1327 out tokens · 41644 ms · 2026-06-28T14:19:38.438046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 6 linked inside Pith

  1. [1]

    Preprint, arXiv:2602.16763

    When ai benchmarks plateau: A systematic study of benchmark saturation. Preprint, arXiv:2602.16763. Guillaume Alain and Yoshua Bengio

  2. [2]

    Preprint, arXiv:1908.05739

    Abductive commonsense reasoning. Preprint, arXiv:1908.05739. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Win...

  3. [3]

    Preprint, arXiv:2005.14165

    Language models are few-shot learners. Preprint, arXiv:2005.14165. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

  4. [4]

    Preprint, arXiv:1803.05457

    Think you have solved question answering? try arc, the ai2 reasoning challenge. Preprint, arXiv:1803.05457. Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni

  5. [5]

    Preprint, arXiv:2407.02678

    Reasoning in large language models: A geometric perspective. Preprint, arXiv:2407.02678. Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio

  6. [6]

    Preprint, arXiv:2209.00840

    Folio: Natural language reasoning with first-order logic. Preprint, arXiv:2209.00840. John Hewitt and Percy Liang

  7. [7]

    Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 2733–2743, Hong Kong, China. Association for Computational Linguistics. John Hewitt and Christopher D. Manning

  8. [8]

    A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics. Jie Huang and Kevin Chen-Chuan Chang

  9. [9]

    arXiv preprint arXiv:2310.06824

    The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov

  10. [10]

    Https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

    In-context learning and induction heads. Https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, P...

  11. [11]

    Preprint, arXiv:2303.08774

    Gpt-4 technical report. Preprint, arXiv:2303.08774. Subramanyam Sahoo, Aman Chadha, Vinija Jain, and Divya Chaudhary

  12. [12]

    Preprint, arXiv:2603.09200

    The reasoning trap – logical reasoning as a mechanistic pathway to situational awareness. Preprint, arXiv:2603.09200. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, Davi...

  13. [13]

    Preprint, arXiv:2201.11903

    Chain-of-thought prompting elicits reasoning in large language models. Preprint, arXiv:2201.11903. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others

  14. [14]

    Estimated via TwoNN (Facco et al., 2017)

    Intrinsic dimensionality. Estimated via TwoNN (Facco et al., 2017). For each point, we compute µ = r2/r1 (ratio of second to first nearest-neighbor distance). The estimator is: ˆdID = 1 n nX i=1 log µi !−1 (2) Neighborhood size k = max(3, min(⌊√Ncorrect⌋, |Hm|/3)). Local curvature. For each point hi, we compute SVD of its k-nearest-neighbor patch. Curvatu...