pith. machine review for the scientific record. sign in

arxiv: 2605.14449 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords hallucination detectionorthogonal decompositionLLM representationsout-of-domain generalizationprobing methodsfactuality assessmentsingle-pass detection
0
0 comments X

The pith

Projecting away the question direction from answer representations isolates reliable factuality signals for hallucination detection in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes QAOD to detect hallucinations by decomposing answer representations orthogonally to the question. This suppresses domain-specific variation, allowing a single-pass method that works well both in-domain and when the domain shifts. The joint probe, which includes question context, excels at in-domain detection with top AUROC scores. The orthogonal-only probe focuses on domain-agnostic signals and shows strong transfer to out-of-domain settings, beating previous white-box methods by significant margins while using far less compute than consistency-based approaches.

Core claim

QAOD projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. Layer and neuron selection via Fisher scoring identify informative signals. The joint probe with question context maximizes in-domain discriminability, while the orthogonal component alone preserves domain-agnostic factuality signals for robust OOD transfer.

What carries the argument

The question-orthogonal component of the answer representation, obtained by projecting away the question-aligned direction.

If this is right

  • The joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs.
  • The orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ.
  • This is achieved at under 25% of the generation cost of consistency methods.
  • Layer selection via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance identify the informative signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This decomposition suggests that much of the domain shift in hallucination signals aligns with question features in the representation space.
  • The method could be extended to other single-pass detection tasks where separating context from content is useful.
  • Further gains might come from applying similar orthogonal projections in multimodal or multi-turn settings.

Load-bearing premise

Removing the question-aligned direction from the answer representation suppresses domain-conditioned variation and isolates reliable factuality signals usable for both in-domain and OOD detection.

What would settle it

Running the orthogonal-only probe on a held-out domain shift dataset such as BioASQ and observing AUROC no better than the best white-box baseline would falsify the OOD robustness claim.

Figures

Figures reproduced from arXiv: 2605.14449 by Erhu Feng, Siyang Yao, Yubin Xia.

Figure 1
Figure 1. Figure 1: QAOD architecture: an offline selection branch identifies discriminative layers and neurons via Fisher-based scores; the online backbone extracts selected features and computes question-orthogonal components in a single forward pass. 3.2 Orthogonal deviation computation For each transformer layer l ∈ {1, . . . , L}, we compute the question-orthogonal component v (l) ⊥ by removing from the answer representa… view at source ↗
Figure 2
Figure 2. Figure 2: Zero-shot OOD AUROC on BioASQ across four LLMs. Bars show OOD AUROC for detectors trained [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise normalized centroid shift (∥µsrc − µtgt∥/σ¯) from {TriviaQA, SQuAD, NQ} to BioASQ. The question-orthogonal component v⊥ exhibits the smallest displacement across all layers, with domain-dependent variation concentrated in the question-aligned component h∥. h∥ = h (l) A ·h (l) Q ∥h (l) Q ∥2 h (l) Q undergoes the largest displacement, the original answer state hA lies in the middle, and the questi… view at source ↗
Figure 5
Figure 5. Figure 5: Layer-selection ablation on Qwen3-14B / TriviaQA (AUROC, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Neuron-selection ablation on the cumulative Fisher threshold [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Inference-time comparison (log scale), including shared generation cost and additional detection overhead; [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD's joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces QAOD, a single-pass hallucination detection framework for LLMs that performs orthogonal decomposition on answer hidden states to remove question-aligned directions, thereby suppressing domain-specific variation. It uses diversity-penalized Fisher scoring to select layers and Fisher importance for neurons, and proposes a joint probe (orthogonal component + question context) for in-domain detection and an orthogonal-only probe for OOD generalization. The authors claim that the joint probe achieves the best in-domain AUROC across evaluated models and datasets, while the orthogonal-only probe provides the strongest OOD transfer, outperforming white-box baselines by up to 21% on BioASQ with less than 25% of the generation cost.

Significance. If the empirical results are robust and the core assumption holds, this approach offers a promising balance of efficiency and robustness for hallucination detection, particularly in cross-domain settings where consistency-based methods are computationally expensive. The use of orthogonal projection to isolate factuality signals could advance white-box probing techniques.

major comments (3)
  1. [Method] The central claim that the question-orthogonal component isolates domain-agnostic factuality signals rests on an untested assumption; no analysis demonstrates that the removed question-aligned subspace correlates more strongly with domain labels than with hallucination labels (see the decomposition construction and OOD transfer claims).
  2. [Abstract and Experiments] Performance numbers in the abstract (e.g., best in-domain AUROC and up to 21% OOD gain on BioASQ) are stated without accompanying experimental details such as dataset sizes, number of runs, error bars, or ablation results on layer/neuron selection, undermining verifiability of the reported superiority.
  3. [§3.2] The diversity-penalized Fisher scoring for layer selection introduces free parameters that appear tuned on evaluation data, creating a risk of circularity in the reported in-domain and OOD results.
minor comments (2)
  1. [Abstract] The abstract contains typographical errors (e.g., 'accu racy', 'or thogonal') that should be corrected for clarity.
  2. [Method] An explicit equation for the orthogonal projection operator would improve the readability of the decomposition step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Method] The central claim that the question-orthogonal component isolates domain-agnostic factuality signals rests on an untested assumption; no analysis demonstrates that the removed question-aligned subspace correlates more strongly with domain labels than with hallucination labels (see the decomposition construction and OOD transfer claims).

    Authors: We agree that a direct quantitative comparison of correlations between the question-aligned subspace and domain versus hallucination labels would provide stronger support for the claim. The current evidence is primarily indirect via the OOD transfer gains of the orthogonal-only probe. In the revision we will add an analysis computing Pearson correlations of the removed directions with domain labels and hallucination labels across layers, and report these results in a new subsection of Section 3. revision: yes

  2. Referee: [Abstract and Experiments] Performance numbers in the abstract (e.g., best in-domain AUROC and up to 21% OOD gain on BioASQ) are stated without accompanying experimental details such as dataset sizes, number of runs, error bars, or ablation results on layer/neuron selection, undermining verifiability of the reported superiority.

    Authors: The full experimental details (dataset sizes, number of runs, standard deviations, and layer/neuron ablation tables) are already present in Section 4 and Appendix B. To improve verifiability we will revise the abstract to briefly note the number of runs and report mean AUROC with standard deviation for the key claims, while retaining the concise summary format. revision: partial

  3. Referee: [§3.2] The diversity-penalized Fisher scoring for layer selection introduces free parameters that appear tuned on evaluation data, creating a risk of circularity in the reported in-domain and OOD results.

    Authors: The diversity penalty coefficient and layer-selection threshold were chosen on a held-out validation split drawn from the in-domain training distribution, separate from both the in-domain test sets and the OOD evaluation sets. We will expand Section 3.2 to explicitly describe this validation procedure, list the exact hyperparameter values, and add a sensitivity plot showing performance variation with the penalty coefficient. revision: yes

Circularity Check

0 steps flagged

No significant circularity in QAOD derivation chain

full rationale

The paper's core construction applies standard orthogonal projection to remove a question-aligned direction from answer representations, followed by Fisher-based layer and neuron selection. These steps are presented as methodological choices whose outputs are then evaluated empirically on in-domain and OOD benchmarks. No equation or procedure reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on self-citation. The reported AUROC gains are empirical outcomes rather than tautological consequences of the decomposition definition. The framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Core method rests on the unproven domain assumption that question-orthogonal components isolate factuality signals; layer and neuron selection via Fisher scoring introduces fitted parameters whose values are not reported.

free parameters (1)
  • diversity-penalized Fisher scoring parameters
    Layer selection and neuron importance scores are computed from data and therefore fitted.
axioms (1)
  • domain assumption Projecting away the question-aligned direction suppresses domain-conditioned variation
    Central premise of the orthogonal decomposition step.

pith-pipeline@v0.9.0 · 5518 in / 1258 out tokens · 35274 ms · 2026-05-15T01:54:27.392639+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    2023 , eprint=

    SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models , author=. 2023 , eprint=

  2. [2]

    doi: 10.1038/ s41586-024-07421-0

    Farquhar, Sebastian and Kossen, Jannik and Kuhn, Lorenz and Gal, Yarin , title =. Nature , year =. doi:10.1038/s41586-024-07421-0 , url =

  3. [3]

    2022 , eprint=

    Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

  4. [4]

    Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

    Su, Weihang and Wang, Changyue and Ai, Qingyao and Hu, Yiran and Wu, Zhijing and Zhou, Yujia and Liu, Yiqun. Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.854

  5. [5]

    2024 , eprint=

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. 2024 , eprint=

  6. [6]

    2021 , eprint=

    Implicit Representations of Meaning in Neural Language Models , author=. 2021 , eprint=

  7. [7]

    2023 , eprint=

    The Internal State of an LLM Knows When It's Lying , author=. 2023 , eprint=

  8. [8]

    and Herrmann, Daniel A

    Levinstein, Benjamin A. and Herrmann, Daniel A. , year=. Still no lie detector for language models: probing empirical and conceptual roadblocks , volume=. Philosophical Studies , publisher=. doi:10.1007/s11098-023-02094-3 , number=

  9. [9]

    2024 , eprint=

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models , author=. 2024 , eprint=

  10. [10]

    2025 , eprint=

    SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs , author=. 2025 , eprint=

  11. [11]

    2024 , eprint=

    INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection , author=. 2024 , eprint=

  12. [12]

    2025 , eprint=

    Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Ambiguous Prompts and Unanswerable Questions , author=. 2025 , eprint=

  13. [13]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

  14. [14]

    https://aclanthology.org/ Q19-1026/

    Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title...

  15. [15]

    2018 , eprint=

    Know What You Don't Know: Unanswerable Questions for SQuAD , author=. 2018 , eprint=

  16. [16]

    2022 , doi =

    Krithara, Anastasia and Nentidis, Anastasios and Bougiatiotis, Konstantinos and Paliouras, Georgios , title =. 2022 , doi =. https://www.biorxiv.org/content/early/2022/12/16/2022.12.14.520213.full.pdf , journal =

  17. [17]

    2024 , eprint=

    Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

  18. [18]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  19. [19]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  20. [20]

    2024 , eprint=

    A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models , author=. 2024 , eprint=

  21. [21]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

    Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , year=. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=. ACM Transactions on Information Systems , publis...

  22. [22]

    ACM Comput

    Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , year=. Survey of Hallucination in Natural Language Generation , volume=. ACM Computing Surveys , publisher=. doi:10.1145/3571730 , number=

  23. [23]

    2023 , eprint=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  24. [24]

    Large language models and the perils of their hallucinations , volume =

    Azamfirei, Razvan and Kudchadkar, Sapna and Fackler, James , year =. Large language models and the perils of their hallucinations , volume =. Critical Care , doi =

  25. [25]

    2022 , eprint=

    Teaching Models to Express Their Uncertainty in Words , author=. 2022 , eprint=

  26. [26]

    Computational Linguistics , year =

    Belinkov, Yonatan. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics. 2022. doi:10.1162/coli_a_00422

  27. [27]

    2024 , eprint=

    Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. 2024 , eprint=

  28. [28]

    LLM-Check: Investigating Detection of Hallucinations in Large Language Models , url =

    Sriramanan, Gaurang and Bharti, Siddhant and Sadasivan, Vinu Sankar and Saha, Shoumik and Kattakinda, Priyatham and Feizi, Soheil , booktitle =. LLM-Check: Investigating Detection of Hallucinations in Large Language Models , url =. doi:10.52202/079017-1077 , editor =

  29. [29]

    2024 , eprint=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2024 , eprint=

  30. [30]

    2024 , eprint=

    Factuality of Large Language Models: A Survey , author=. 2024 , eprint=

  31. [31]

    2024 , eprint=

    Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models , author=. 2024 , eprint=

  32. [32]

    2025 , eprint=

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models , author=. 2025 , eprint=

  33. [33]

    2025 , eprint=

    Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2025 , eprint=

  34. [34]

    2023 , eprint=

    FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios , author=. 2023 , eprint=

  35. [35]

    Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

    Detecting Hallucination in Large Language Models Through Deep Internal Representation Analysis , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , month =. doi:10.24963/ijcai.2025/929 , url =

  36. [36]

    2023 , eprint=

    Attention Is All You Need , author=. 2023 , eprint=

  37. [37]

    2019 , eprint=

    Similarity of Neural Network Representations Revisited , author=. 2019 , eprint=

  38. [38]

    Ranked Voting based Self-Consistency of Large Language Models

    Wang, Weiqin and Wang, Yile and Huang, Hui. Ranked Voting based Self-Consistency of Large Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.744