pith. sign in

arxiv: 2606.02628 · v1 · pith:URSHQLWFnew · submitted 2026-05-30 · 💻 cs.LG · cs.CL

Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

Pith reviewed 2026-06-28 19:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords hallucination detectionlinear probeshidden statesLLMstruthfulness signalmid-layerquantized modelsAUROC evaluation
0
0 comments X

The pith

A linear probe on a single mid-network layer decodes hallucination from hidden states in quantized LLMs at 0.904-1.000 AUROC.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether instruction-tuned LLMs encode a truthfulness signal inside their activations that separates hallucinated from truthful outputs. It extracts hidden states layer by layer from three 7-8B models under 4-bit quantization and trains simple detectors on four benchmarks. A linear classifier at mid-depth reaches near-perfect separation on held-out data, while methods that rely on sampling multiple answers stay near chance. The result indicates that the relevant information is already present in a linear form at predictable depths rather than requiring post-generation checks.

Core claim

Across Llama-3.1-8B, Mistral-7B and Qwen2.5-7B in NF4 quantization, hidden states from a single mid-network layer (blocks 13-18 for Llama/Mistral, 19-25 for Qwen) allow a linear probe to achieve 0.904-1.000 AUROC for hallucination detection on held-out splits of TruthfulQA, HaluEval-QA, FEVER and a synthetic set, while sampling-based detectors remain below 0.541 AUROC; MLP probes add at most 0.01 AUROC and first-block attention entropy supplies a complementary zero-cost signal on knowledge-grounded tasks.

What carries the argument

Linear probe applied to per-layer hidden states, which extracts an approximately linear truthfulness signal whose strength peaks in a consistent mid-network band.

If this is right

  • Detection requires only one forward pass plus a fixed linear classifier at a known layer.
  • The location of the strongest signal remains stable across model families on natural-language benchmarks.
  • Nonlinear probes add negligible value, confirming the signal is already linear.
  • Attention entropy at the first block supplies an orthogonal signal at no added cost on knowledge tasks.
  • Sampling methods appear weak here because paired-label evaluation does not match the information they use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time monitoring of the identified mid-layer could allow generation to be halted or corrected before output.
  • Because the signal is linear, targeted activation editing at those layers might increase overall truthfulness.
  • The same extraction method could be tested on other model behaviors such as confidence or refusal.
  • If the probe generalizes across domains, it would reduce reliance on expensive sampling or external verifiers.

Load-bearing premise

The benchmark labels correctly mark whether each model response matches an internal truthfulness representation rather than reflecting prompt difficulty or annotation artifacts.

What would settle it

Retraining the probe on the same activations but with hallucination labels randomly shuffled or replaced by an independent external fact-check set yields AUROC near 0.5.

Figures

Figures reproduced from arXiv: 2606.02628 by Aizierjiang Aiersilan.

Figure 1
Figure 1. Figure 1: Method comparison on HaluEval-QA. Probes reach ≥ 0.997 AUROC on all three models; attention entropy at the first block yields 0.866–0.941 AUROC at no extra inference cost; INSIDE and self-consistency are at chance [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Layer-wise MLP-probe AUROC for Llama-3.1-8B on TruthfulQA; the signal peaks near block 14 (shaded band: std over three seeds). Right: Per-block MLP￾probe AUROC heatmap across all three models on TruthfulQA; the peak band is consistently in the second half of the network. 4.2 Layer-wise probing trajectories Probing performance varies smoothly with depth ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Layer-wise class geometry for Llama-3.1-8B on TruthfulQA showing centroid distance, within-class spread, and their ratio; the separation ratio plateaus in the second half of the network, mirroring probe AUROC ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Last-token attention entropy on Llama-3.1-8B / HaluEval-QA at three transformer blocks; the first block produces the best class separation (AUROC 0.941). Right: INSIDE EigenScore distribution on Llama-3.1-8B / TruthfulQA; the truthful and hallucinated histograms overlap almost completely (AUROC 0.433). 4.5 Why sampling-based methods underperform As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 2
Figure 2. Figure 2: – Appendix D shows class-separation diagnostics for Mistral and Qwen on TruthfulQA ( [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: – Appendix H provides the Qwen t-SNE projection at its probing peak ( [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-dataset method-comparison bar plots for TruthfulQA (top), FEVER (mid￾dle), and the synthetic benchmark (bottom). Probes are uniformly at the top, attention entropy is competitive on knowledge-grounded settings, and sampling-based detectors cluster at chance. Cf [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MLP-probe per-block AUROC heatmaps on HaluEval-QA (top), FEVER (middle), and the synthetic benchmark (bottom). Rows within each heatmap correspond to the three models; columns are block indices; color encodes AUROC. The bright peak band is consistently in the second half of the network. Cf [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: reproduces the per-block MLP-probe AUROC curves of Mistral-7B and Qwen2.5-7B on TruthfulQA, complementing [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-block MLP-probe AUROC for Llama-3.1-8B on HaluEval-QA. The plateau saturates at AUROC ≈0.998 across blocks ∼12–24, with seed standard deviation below 0.005. 0 5 10 15 20 25 30 layer index 0 50 100 150 200 250 distance (L2) Centroid distance vs within-class spread ‖μ1 − μ0‖ spread (truthful) spread (hallucinated) 0 5 10 15 20 25 30 layer index 0.15 0.20 0.25 0.30 0.35 separation ratio Centroid distance … view at source ↗
Figure 9
Figure 9. Figure 9: Layer-wise class geometry on TruthfulQA for Mistral-7B (top) and Qwen2.5-7B (bottom), showing centroid distance, within-class spread, and separation ratio. The separation ratio rises through the early blocks and saturates in the mid-to-late blocks, mirroring the probe-AUROC trajectories ( [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: reproduces the per-class INSIDE EigenScore distribution on TruthfulQA for Mistral-7B and Qwen2.5-7B, complementing [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Last-token attention entropy on HaluEval-QA for Mistral-7B (top) and Qwen2.5-7B (bottom) at the first, middle, and last captured transformer blocks. The first-block separation (AUROC 0.902 and 0.866 respectively) confirms the pattern observed for Llama-3.1-8B in [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: 2-D PCA projections on TruthfulQA at each model’s probing peak. Left: Llama-3.1-8B at block 14. Right: Qwen2.5-7B at block 19. Both models show clear class separation along the leading principal components, consistent with the high linear-probe AUROC. Cf [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: t-SNE projection of the layer-19 hidden states of Qwen2.5-7B on TruthfulQA (Qwen probing peak, perplexity 30). The two classes form distinct clusters with a thin margin region, confirming the geometric separation that underlies the 0.925 MLP-probe AUROC. I Synthetic Benchmark Construction The synthetic benchmark is generated locally from four small knowledge banks: world capitals (30 countries), chemical … view at source ↗
Figure 6
Figure 6. Figure 6: The linear and MLP heatmaps are visually almost indistinguishable [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 14
Figure 14. Figure 14: Per-block AUROC heatmaps for the linear SAPLMA probe across all four datasets. Top-left: TruthfulQA. Top-right: HaluEval-QA. Bottom-left: FEVER. Bottom￾right: synthetic. Rows within each heatmap are the three models; columns are block indices; color is AUROC. The pattern is qualitatively identical to the MLP counterparts (Figures 2, 6). K Layer-Wise AUROC Trajectories Across All Cells Figures 15–17 report… view at source ↗
Figure 15
Figure 15. Figure 15: Per-block MLP-probe AUROC for Llama-3.1-8B. Top-left: TruthfulQA. Top￾right: HaluEval-QA. Bottom-left: FEVER. Bottom-right: synthetic. Shaded band: stan￾dard deviation over three seeds. 0 5 10 15 20 25 30 layer index (0 = embeddings) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 AUROC mistral-7b / truthfulqa / mlp AUROC chance 0 5 10 15 20 25 30 layer index (0 = embeddings) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 AUROC mistral-7b / ha… view at source ↗
Figure 16
Figure 16. Figure 16: Per-block MLP-probe AUROC for Mistral-7B-Instruct-v0.3. Top-left: Truth￾fulQA. Top-right: HaluEval-QA. Bottom-left: FEVER. Bottom-right: synthetic. Shaded band: standard deviation over three seeds [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Per-block MLP-probe AUROC for Qwen2.5-7B-Instruct (28 transformer blocks). Top-left: TruthfulQA. Top-right: HaluEval-QA. Bottom-left: FEVER. Bottom-right: synthetic. Shaded band: standard deviation over three seeds [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Layer-wise class geometry for Llama-3.1-8B on HaluEval-QA (top), FEVER (middle), and the synthetic benchmark (bottom), showing centroid distance, within-class spread, and separation ratio. The separation ratio plateaus in mid-to-late blocks across all three datasets. Cf [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Layer-wise class geometry for Mistral-7B on HaluEval-QA (top), FEVER (middle), and the synthetic benchmark (bottom). The same rise-and-plateau pattern holds across datasets. Cf [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Layer-wise class geometry for Qwen2.5-7B on HaluEval-QA (top), FEVER (middle), and the synthetic benchmark (bottom). The synthetic benchmark exhibits the steepest centroid drift, consistent with the ≈1.000 AUROC plateau. Cf [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: INSIDE EigenScore distributions for Llama-3.1-8B on HaluEval-QA (top), FEVER (middle), and the synthetic benchmark (bottom), separated by ground-truth label. The two class distributions are nearly indistinguishable in every panel. Cf [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: INSIDE EigenScore distributions for Mistral-7B on HaluEval-QA (top), FEVER (middle), and the synthetic benchmark (bottom). The near-complete overlap persists across all datasets. Cf [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: INSIDE EigenScore distributions for Qwen2.5-7B on HaluEval-QA (left), FEVER (middle), and the synthetic benchmark (right). The uniform overlap across all 12 configurations confirms that the chance-level INSIDE AUROC (0.433–0.529) is inherent to the evaluation protocol. Cf [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Last-token attention entropy on TruthfulQA for Llama-3.1-8B (top), Mistral￾7B (middle), and Qwen2.5-7B (bottom) at three captured transformer blocks. The truthful and hallucinated histograms overlap substantially in every panel, with the best per-model AUROC remaining below 0.61 [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Last-token attention entropy on FEVER for Llama-3.1-8B (top), Mistral-7B (middle), and Qwen2.5-7B (bottom). Without an explicit supporting-knowledge channel, the per-class entropy histograms are no longer cleanly separable, contrasting with the strong first-block signal on HaluEval-QA ( [PITH_FULL_IMAGE:figures/full_fig_p030_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Last-token attention entropy on the synthetic benchmark for Llama-3.1-8B (top), Mistral-7B (middle), and Qwen2.5-7B (bottom). The middle block performs best for Mistral (AUROC 0.774), but the signal remains far weaker than probe-based detectors across all three models [PITH_FULL_IMAGE:figures/full_fig_p031_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: PCA (left) and t-SNE (right) projections of Llama-3.1-8B hidden states at the block maximizing the classifier-free separation ratio: HaluEval-QA (top, block 32), FEVER (middle, block 0), and synthetic (bottom, block 12). Panels at block 0 show heavily overlapping clouds because the separation ratio peaks early due to lexical￾identity effects rather than truthfulness geometry. Cf. Figures 3, 12 for the Tru… view at source ↗
Figure 28
Figure 28. Figure 28: PCA (left) and t-SNE (right) projections of Mistral-7B hidden states at the separation-ratio argmax block. Top to bottom: TruthfulQA (block 0), HaluEval-QA (block 0), FEVER (block 0), and synthetic (block 14). The three block-0 panels show overlapping clusters; the synthetic block-14 panel shows clearer separation, consistent with its high probe AUROC [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: PCA (left) and t-SNE (right) projections of Qwen2.5-7B hidden states at the separation-ratio argmax block. Top to bottom: HaluEval-QA (block 0), FEVER (block 0), and synthetic (block 18). The block-18 synthetic panel shows clear class clusters; the block-0 panels reflect lexical rather than truthfulness separation. Cf. Figures 12, 13 for the TruthfulQA probing-peak projection [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 30
Figure 30. Figure 30 [PITH_FULL_IMAGE:figures/full_fig_p036_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Per-block linear-probe AUROC for Mistral-7B-Instruct-v0.3. Top-left: Truth￾fulQA. Top-right: HaluEval-QA. Bottom-left: FEVER. Bottom-right: synthetic. Shaded band: standard deviation over three seeds. Cf [PITH_FULL_IMAGE:figures/full_fig_p037_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Per-block linear-probe AUROC for Qwen2.5-7B-Instruct (28 transformer blocks). Top-left: TruthfulQA. Top-right: HaluEval-QA. Bottom-left: FEVER. Bottom￾right: synthetic. Shaded band: standard deviation over three seeds. Cf [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗
read the original abstract

We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three $7$B--$8$B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in $4$-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves $0.904$--$1.000$ AUROC on held-out splits, while sampling-based detectors do not exceed $0.541$ AUROC under the same protocol. The truthfulness signal is approximately linear: MLP probes rarely surpass linear probes by more than $0.01$ AUROC. Peak probing layers fall in a consistent band across model families on natural-language benchmarks -- blocks~$13$--$18$ of~$32$ for Llama and Mistral, and blocks~$19$--$25$ of~$28$ for Qwen. First-block attention entropy provides a complementary signal in knowledge-grounded settings ($0.866$--$0.941$ AUROC on HaluEval-QA) at no additional inference cost. The low discriminability of sampling methods under this protocol reflects a structural mismatch between paired-label evaluation and the information these methods access, rather than an inherent limitation of those methods. Code and data are released for full reproducibility on a single $8$\,GB GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that in 4-bit quantized 7-8B instruction-tuned LLMs (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B), a linear probe trained on hidden states from a single mid-network layer achieves 0.904-1.000 AUROC for hallucination detection on held-out splits of TruthfulQA, HaluEval-QA, FEVER, and a synthetic benchmark, substantially outperforming sampling-based detectors (max 0.541 AUROC). It further claims the truthfulness signal is approximately linear (MLP probes add at most 0.01 AUROC), that peak layers are consistent across model families (blocks 13-18/32 for Llama/Mistral; 19-25/28 for Qwen), and that first-block attention entropy provides a complementary low-cost signal on knowledge-grounded tasks.

Significance. If the central empirical result holds after addressing label validity, the work would demonstrate that hallucination-related information is linearly decodable from mid-layer activations in quantized models, offering a computationally cheap alternative to sampling-based detection. The release of code and data for single-GPU reproducibility is a clear strength that enables direct verification and extension.

major comments (2)
  1. [Evaluation on natural-language benchmarks (TruthfulQA, HaluEval-QA, FEVER)] The interpretation that high AUROC demonstrates a 'linearly decodable truthfulness signal' inside the model rests on the untested assumption that external benchmark labels (human judgments on TruthfulQA, fact-checker labels on FEVER, etc.) align with whatever the quantized model has internally represented. No analysis is provided showing that the probe is not instead capturing surface correlates such as answer length, lexical overlap, or prompt difficulty. This is load-bearing for the headline claim because the natural-language benchmarks drive the cross-model consistency result; the synthetic set alone is insufficient to support it.
  2. [Probe training and evaluation protocol] The abstract and reported results supply no information on the linear probe training procedure, exact train/test splits, regularization, optimization hyperparameters, or any control for post-hoc layer selection. Without these details the reported 0.904-1.000 AUROC values cannot be independently verified or reproduced from the released code alone.
minor comments (1)
  1. [Discussion of sampling-based detectors] The claim that 'the low discriminability of sampling methods ... reflects a structural mismatch' is asserted but would benefit from a short explicit comparison of what information each method has access to under the paired-label protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight important aspects of interpretability and reproducibility. We address each below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation on natural-language benchmarks (TruthfulQA, HaluEval-QA, FEVER)] The interpretation that high AUROC demonstrates a 'linearly decodable truthfulness signal' inside the model rests on the untested assumption that external benchmark labels (human judgments on TruthfulQA, fact-checker labels on FEVER, etc.) align with whatever the quantized model has internally represented. No analysis is provided showing that the probe is not instead capturing surface correlates such as answer length, lexical overlap, or prompt difficulty. This is load-bearing for the headline claim because the natural-language benchmarks drive the cross-model consistency result; the synthetic set alone is insufficient to support it.

    Authors: We agree that additional evidence is needed to rule out surface correlates as the primary driver of probe performance on the natural-language benchmarks. The synthetic benchmark provides controlled labels, but does not fully address this for the cross-model consistency results. In revision we will add an ablation that regresses out answer length, lexical overlap (Jaccard and TF-IDF), and prompt difficulty proxies (e.g., perplexity of the prompt) before training the probes, and report the resulting AUROC drop. If the drop is small, this will support the internal-signal interpretation; if large, we will qualify the claims accordingly. revision: yes

  2. Referee: [Probe training and evaluation protocol] The abstract and reported results supply no information on the linear probe training procedure, exact train/test splits, regularization, optimization hyperparameters, or any control for post-hoc layer selection. Without these details the reported 0.904-1.000 AUROC values cannot be independently verified or reproduced from the released code alone.

    Authors: The full manuscript contains a methods subsection on probe training (logistic regression via scikit-learn with L2 regularization, 5-fold cross-validation on the training portion, 80/20 stratified splits, and layer selection via validation AUROC), and the released repository includes the exact training scripts. However, we acknowledge that these details are insufficiently prominent. We will expand the methods section with all hyperparameters, split statistics, and the precise layer-selection protocol so that the paper is self-contained and the numbers are directly reproducible from the text alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical AUROC results on external benchmarks with held-out evaluation

full rationale

The paper performs standard supervised probing: hidden states are extracted from fixed external benchmarks (TruthfulQA, HaluEval-QA, FEVER, synthetic set), linear classifiers are trained on train splits, and AUROC is measured on held-out test splits. No equations, predictions, or first-principles claims reduce the reported performance to quantities defined by the authors' own fitted parameters or self-citations. The protocol is externally falsifiable via the released code and data; the central claim (linear decodability) is a measured empirical outcome rather than a definitional or self-referential identity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about linear separability and benchmark validity. No new physical entities or theoretical constructs are introduced.

free parameters (1)
  • Selected probe layer
    Peak layers are identified from experimental results on the same benchmarks, introducing a data-dependent choice.
axioms (1)
  • domain assumption Hidden states at a given layer contain a linearly separable truthfulness signal that aligns with benchmark labels.
    This is the central hypothesis tested by the probing experiments.

pith-pipeline@v0.9.1-grok · 5852 in / 1284 out tokens · 32690 ms · 2026-06-28T19:11:37.708606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 8 canonical work pages · 7 internal anchors

  1. [1]

    Understanding intermediate layers using linear classifier probes

    Alain, G., Bengio, Y.: Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016)

  2. [2]

    In: Findings of the Association for Computational Linguistics: EMNLP 2023

    Azaria, A., Mitchell, T.: The internal state of an llm knows when it’s lying. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 967–976 (2023)

  3. [3]

    Discovering Latent Knowledge in Language Models Without Supervision

    Burns, C., Ye, H., Klein, D., Steinhardt, J.: Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827 (2022)

  4. [4]

    arXiv preprint arXiv:2402.03744 (2024)

    Chen, C., Liu, K., Chen, Z., Gu, Y., Wu, Y., Tao, M., Fu, Z., Ye, J.: Inside: Llms’ internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744 (2024)

  5. [5]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Cohen, R., Hamri, M., Geva, M., Globerson, A.: Lm vs lm: Detecting factual errors via cross examination. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 12621–12640 (2023)

  6. [6]

    int8 (): 8-bit matrix multiplication for transformers at scale

    Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L.: Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems35, 30318–30332 (2022)

  7. [7]

    Advances in neural information processing systems36, 10088– 10115 (2023)

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems36, 10088– 10115 (2023)

  8. [8]

    Nature630(8017), 625–630 (2024)

    Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large language models using semantic entropy. Nature630(8017), 625–630 (2024)

  9. [9]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 12 Aizierjiang Aiersilan

  10. [10]

    ACM Transactions on Information Systems43(2), 1–55 (2025)

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al.: A survey on hallucination in large language models: Princi- ples, taxonomy, challenges, and open questions. ACM Transactions on Information Systems43(2), 1–55 (2025)

  11. [11]

    ACM computing surveys55(12), 1–38 (2023)

    Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., Fung, P.: Survey of hallucination in natural language generation. ACM computing surveys55(12), 1–38 (2023)

  12. [12]

    Mistral 7B

    Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., El Sayed, W.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023),https://arxiv.org/abs/2310.06825

  13. [13]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    Kuhn, L., Gal, Y., Farquhar, S.: Semantic uncertainty: Linguistic invariances for un- certaintyestimation in natural languagegeneration. arXiv preprintarXiv:2302.09664 (2023)

  14. [14]

    In: Proceedings of the 2023 conference on empirical methods in natural language processing

    Li,J.,Cheng,X.,Zhao,X.,Nie,J.Y.,Wen,J.R.:Halueval:Alarge-scalehallucination evaluation benchmark for large language models. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 6449–6464 (2023)

  15. [15]

    Advances in Neural Information Processing Systems36, 41451–41530 (2023)

    Li, K., Patel, O., Viégas, F., Pfister, H., Wattenberg, M.: Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems36, 41451–41530 (2023)

  16. [16]

    In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers)

    Lin, S., Hilton, J., Evans, O.: Truthfulqa: Measuring how models mimic human falsehoods. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). pp. 3214–3252 (2022)

  17. [17]

    Journal of machine learning research9(11) (2008)

    Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research9(11) (2008)

  18. [18]

    In: Proceedings of the 2023 conference on empirical methods in natural language processing

    Manakul, P., Liusie, A., Gales, M.: Selfcheckgpt: Zero-resource black-box hallucina- tion detection for generative large language models. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 9004–9017 (2023)

  19. [19]

    Advances in neural information processing systems35, 27730–27744 (2022)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing systems35, 27730–27744 (2022)

  20. [20]

    Qwen2.5 Technical Report

    Qwen Team: Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 (2024), https://arxiv.org/abs/2412.15115

  21. [21]

    In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

    Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 3982–3992 (2019)

  22. [22]

    In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

    Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: Fever: a large-scale dataset for fact extraction and verification. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp. 809–819 (2018)

  23. [23]

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022) Linearly Decodable Hallucination in Quantized LLMs 13 Appendix Roadmap The eighteen appendices that follow provide exhaustive supporting evidence for every ...