Can LLMs Introspect? A Reality Check

Shashwat Singh; Shauli Ravfogel; Tal Linzen

arxiv: 2605.26242 · v1 · pith:IIMCECQBnew · submitted 2026-05-25 · 💻 cs.AI

Can LLMs Introspect? A Reality Check

Shashwat Singh , Tal Linzen , Shauli Ravfogel This is my paper

Pith reviewed 2026-06-29 21:21 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLMsintrospectionmetacognitioninternal statespattern matchinghidden representationsevaluation paradigmsanomaly detection

0 comments

The pith

Models cannot reliably distinguish internal state interventions from input manipulations, and input-only classifiers match their performance on hidden-state tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether LLMs can introspect by re-testing two recent evaluation methods against the possibility that success comes from surface cues rather than privileged access to internal states. In one setup, models fail to separate changes made to their own representations from changes made only to the input text. In the second, simple classifiers given only the input text perform as well as the models at guessing labels derived from hidden states, and a control version that removes semantic cues drops model accuracy near chance. A sympathetic reader would care because behavioral success alone cannot confirm metacognitive monitoring if the same results can be produced without any internal access.

Core claim

Models cannot reliably distinguish interventions on their internal states from manipulations of the input, suggesting their success reflects general anomaly detection rather than specific sensitivity to internal changes. Classifiers with access only to the input achieve equivalent performance to the model's own in-context predictions of labels derived from hidden states. In a relabeled control setting where semantics cannot guide the answer, models perform closer to chance, indicating that current evidence does not establish privileged internal access or metacognitive monitoring.

What carries the argument

Two controlled evaluation paradigms: detecting tampered internal states versus input anomalies, and predicting labels from hidden states with input-only classifier baselines plus a relabeled semantic-control condition.

If this is right

Success on internal-state detection tasks reflects general anomaly detection rather than targeted introspection.
In-context label predictions derived from hidden states can be replicated by models given only the input.
Relabeled controls reveal that models rely on task semantics rather than internal representations.
Behavioral evidence alone is insufficient to support strong claims of metacognitive monitoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future tests for introspection will need designs that fix the input while varying internal states in undetectable ways.
The same input-only baseline and relabeling approach could be applied to other behavioral claims of self-knowledge in language models.
This pattern suggests that distinguishing pattern matching from genuine internal monitoring remains a central open problem for evaluating AI systems.

Load-bearing premise

That equal performance by input-only classifiers or near-chance results on the relabeled control necessarily shows absence of privileged internal access rather than other unmeasured factors in the experiments.

What would settle it

An experiment in which models detect internal-state interventions at rates clearly higher than matched input manipulations, or substantially outperform input-only classifiers on the label-prediction task even after semantic cues are removed.

Figures

Figures reproduced from arXiv: 2605.26242 by Shashwat Singh, Shauli Ravfogel, Tal Linzen.

**Figure 1.** Figure 1: Input-controlled alternatives to purported introspection results. Left: In the biofeedback paradigm of Ji-An et al. (2025), labels are computed from a model’s hidden state via a linear classifier or top PCA directions (A), then used as targets in in-context learning examples (B). Successful prediction has been interpreted as evidence of introspection. We show these labels are also predictable from unconte… view at source ↗

**Figure 2.** Figure 2: Left: Accuracy on the biofeedback paradigms proposed by Ji-An et al. (2025) drops sharply when potential input-level cues are controlled for: compared to the accuracy on the original dataset (red), the models’ accuracy is much lower after random relabeling, which removes semantic correlations (grey). Right: The accuracy of probes trained to predict the hidden-layer PCA labels only from the input (layer 0: … view at source ↗

**Figure 3.** Figure 3: Response distributions for our extension of the [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Results when we use the prompt in Appendix [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

**Figure 5.** Figure 5: Results when we use the prompt in Appendix [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

**Figure 6.** Figure 6: The model shows low false positives in the 2-option case and non-trivially claims hidden [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

read the original abstract

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that two recent tests for LLM introspection don't hold once input cues are controlled for, with input-only classifiers matching performance and a new relabeled baseline dropping near chance.

read the letter

The main takeaway is that claims of LLM metacognitive monitoring rest on shaky ground. In the tampering detection setup, models cannot separate internal-state changes from input manipulations, pointing to general anomaly detection rather than specific internal access. In the hidden-state label prediction task, input-only classifiers reach the same accuracy as the model's in-context guesses, and the new relabeled control—where semantics are stripped so success requires internal representations—falls close to chance.

The relabeled control is the clearest addition. It directly tests whether above-chance results depend on surface semantics or on privileged internal information, and the drop in performance undercuts the original interpretations. The input-classifier comparison is also direct and useful; it shows the model is not doing anything the input alone cannot explain.

The abstract presents coherent logic without circularity or heavy self-citation. The authors draw from human metacognition work to set a higher bar for evidence, which fits the data they report.

The main soft spot is the lack of visible methods details—dataset sizes, exact statistical tests, error bars, or how the interventions were constructed. Without those, it is hard to judge how tight the performance matches really are or whether small implementation choices drive the results. That said, the central pattern (equivalence plus control failure) is not obviously explained away by unmeasured factors.

This work is for researchers building or citing LLM evaluation benchmarks, especially in AI safety contexts that treat self-reports as evidence of monitoring. It does not close the door on introspection but tightens the standards for what counts as evidence.

I would bring the full paper to a reading group to walk through the controls. It deserves peer review because the experimental design addresses a real gap and the negative findings are falsifiable.

Referee Report

1 major / 1 minor

Summary. The paper claims that prior evidence for LLM introspection is insufficient and likely reflects pattern matching on surface cues rather than genuine metacognitive access to internal states. It re-examines two paradigms: (1) detecting tampering with internal states, where models fail to distinguish such interventions from input manipulations, indicating general anomaly detection; (2) predicting labels from hidden states, where input-only classifiers match the model's in-context performance and a relabeled control (removing semantic cues) yields near-chance results, forcing reliance on internal representations. The conclusion is that behavioral evidence alone cannot establish privileged introspective access.

Significance. If the empirical controls hold, the work provides a timely corrective to overclaims about LLM metacognition by importing standards from human metacognition research and introducing a relabeled control that directly tests for internal access. The explicit design of the relabeled setting (to eliminate semantic leakage) and the input-only classifier baseline are strengths that could raise the bar for future introspection studies. The paper avoids circularity by using new comparisons rather than self-referential results.

major comments (1)

The central claim that input-only classifiers achieve equivalent performance (and that the relabeled control drops to near-chance) is load-bearing, yet the manuscript description provides no dataset sizes, error bars, or statistical tests confirming equivalence or the chance-level result; this weakens the support for the insufficiency conclusion until quantified.

minor comments (1)

The abstract and results summary would benefit from explicit quantitative anchors (e.g., accuracy values or effect sizes) to allow readers to assess the magnitude of the reported equivalences and chance-level performance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for more quantitative support of our central empirical claims. We agree that dataset sizes, error bars, and statistical tests are important for strengthening the evidence and will add them in the revision.

read point-by-point responses

Referee: The central claim that input-only classifiers achieve equivalent performance (and that the relabeled control drops to near-chance) is load-bearing, yet the manuscript description provides no dataset sizes, error bars, or statistical tests confirming equivalence or the chance-level result; this weakens the support for the insufficiency conclusion until quantified.

Authors: We agree that the current manuscript text does not include these details. In the revised version we will add a dedicated results subsection (or appendix table) reporting: (i) exact dataset sizes used for each paradigm and control, (ii) performance means with standard deviations or confidence intervals across multiple runs or seeds, and (iii) statistical tests (e.g., paired t-tests or equivalence tests) comparing the input-only classifier to the model’s in-context performance and confirming that the relabeled condition is statistically indistinguishable from chance. These additions will make the load-bearing claims fully quantifiable without altering the experimental design or conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical re-examination of two prior evaluation paradigms for LLM introspection, introducing new controls such as input-only classifiers and a relabeled setting where semantic cues are removed. These conclusions rest on the explicit experimental design and performance comparisons rather than any self-citation chain, fitted parameters renamed as predictions, or definitional reductions. No equations or derivations are load-bearing; the central claims derive from the reported results on the new controls, which are independent of the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that behavioral equivalence implies lack of privileged access and that the relabeled task isolates internal representations; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Behavioral evidence alone is inherently insufficient to establish strong introspective claims.
Explicitly stated in the abstract as drawn from human metacognition research and used to interpret all results.

pith-pipeline@v0.9.1-grok · 5801 in / 1072 out tokens · 32139 ms · 2026-06-29T21:21:48.507005+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
cs.CL 2026-06 unverdicted novelty 6.0

Fixed counterfactual explanation datasets train LMs such that generated explanations track the model's evolving behavior rather than the fixed targets, due to persistent correlation during training.

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Qwen2.5 Technical Report

URLhttps://openreview.net/forum?id=aajyHYjjsk. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022. Thomas O Nelson. Metamemory: A theoretical framework and new findings. InPsychology of learning and motivation, volume 26, pp. 1...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-023-06647-8 2022
[2]

Siyuan Song, Harvey Lederman, Jennifer Hu, and Kyle Mahowald

URLhttps://aclanthology.org/2023.emnlp-main.220/. Siyuan Song, Harvey Lederman, Jennifer Hu, and Kyle Mahowald. Privileged self-access matters for introspection in AI.arXiv preprint arXiv:2508.14802, 2025. URL https://arxiv.org/abs/2508. 14802. Noam Steinmetz Yalon, Ariel Goldstein, Liad Mudrik, and Mor Geva. Indications of belief-guided agency and meta-c...

work page arXiv 2023
[3]

human-interpretable

URLhttps://arxiv.org/abs/2602.02467. 19 Preprint. Under review. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, I...

work page arXiv 2025
[4]

gaslight

for thought injection, 2. for prompt manipulation, 3. for control trial. Additionally, if you believe this is either of prompt manipulation or vector injection, please also state the concept you believe I made you think about. Trial 1: What do you detect? E.3 Prompt 2 with 3 options For the same experiment, we have a small variation in which an initial re...

2025

[1] [1]

Qwen2.5 Technical Report

URLhttps://openreview.net/forum?id=aajyHYjjsk. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022. Thomas O Nelson. Metamemory: A theoretical framework and new findings. InPsychology of learning and motivation, volume 26, pp. 1...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-023-06647-8 2022

[2] [2]

Siyuan Song, Harvey Lederman, Jennifer Hu, and Kyle Mahowald

URLhttps://aclanthology.org/2023.emnlp-main.220/. Siyuan Song, Harvey Lederman, Jennifer Hu, and Kyle Mahowald. Privileged self-access matters for introspection in AI.arXiv preprint arXiv:2508.14802, 2025. URL https://arxiv.org/abs/2508. 14802. Noam Steinmetz Yalon, Ariel Goldstein, Liad Mudrik, and Mor Geva. Indications of belief-guided agency and meta-c...

work page arXiv 2023

[3] [3]

human-interpretable

URLhttps://arxiv.org/abs/2602.02467. 19 Preprint. Under review. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, I...

work page arXiv 2025

[4] [4]

gaslight

for thought injection, 2. for prompt manipulation, 3. for control trial. Additionally, if you believe this is either of prompt manipulation or vector injection, please also state the concept you believe I made you think about. Trial 1: What do you detect? E.3 Prompt 2 with 3 options For the same experiment, we have a small variation in which an initial re...

2025