pith. machine review for the scientific record. sign in

arxiv: 2510.09033 · v3 · submitted 2025-10-10 · 💻 cs.CL

Recognition: unknown

Do LLMs Really Know What They Don't Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness

Authors on Pith no claims yet
classification 💻 cs.CL
keywords hallucinationsinternalknowoutputsassociationsknowledgellmsprocesses
0
0 comments X
read the original abstract

Recent work suggests that LLMs "know what they don't know", positing that hallucinated and factually correct outputs arise from distinct internal processes and can therefore be distinguished using internal signals. However, hallucinations have multifaceted causes: beyond simple knowledge gaps, they can emerge from training incentives that encourage models to exploit statistical shortcuts or spurious associations learned during pretraining. In this paper, we argue that when LLMs rely on such learned associations to produce hallucinations, their internal processes are mechanistically similar to those of factual recall, as both stem from strong statistical correlations encoded in the model's parameters. To verify this, we propose a novel taxonomy categorizing hallucinations into Unassociated Hallucinations (UHs), where outputs lack parametric grounding, and Associated Hallucinations (AHs), which are driven by spurious associations. Through mechanistic analysis, we compare their computational processes and hidden-state geometries with factually correct outputs. Our results show that hidden states primarily reflect whether the model is recalling parametric knowledge rather than the truthfulness of the output itself. Consequently, AHs exhibit hidden-state geometries that largely overlap with factual outputs, rendering standard detection methods ineffective. In contrast, UHs exhibit distinctive, clustered representations that facilitate reliable detection.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

    cs.CL 2026-04 unverdicted novelty 7.0

    LLMs exhibit domain-specific privileged knowledge in hidden states for factual correctness but not math reasoning, visible only on model disagreement subsets.

  2. CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    CoSToM maps ToM features inside LLMs with causal tracing and steers activations in critical layers to boost intrinsic social reasoning and dialogue quality.