arxiv: 2605.14449 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

Siyang Yao , Erhu Feng , Yubin Xia

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords hallucination detectionorthogonal decompositionLLM representationsout-of-domain generalizationprobing methodsfactuality assessmentsingle-pass detection

0 comments

The pith

Projecting away the question direction from answer representations isolates reliable factuality signals for hallucination detection in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes QAOD to detect hallucinations by decomposing answer representations orthogonally to the question. This suppresses domain-specific variation, allowing a single-pass method that works well both in-domain and when the domain shifts. The joint probe, which includes question context, excels at in-domain detection with top AUROC scores. The orthogonal-only probe focuses on domain-agnostic signals and shows strong transfer to out-of-domain settings, beating previous white-box methods by significant margins while using far less compute than consistency-based approaches.

Core claim

QAOD projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. Layer and neuron selection via Fisher scoring identify informative signals. The joint probe with question context maximizes in-domain discriminability, while the orthogonal component alone preserves domain-agnostic factuality signals for robust OOD transfer.

What carries the argument

The question-orthogonal component of the answer representation, obtained by projecting away the question-aligned direction.

If this is right

The joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs.
The orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ.
This is achieved at under 25% of the generation cost of consistency methods.
Layer selection via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance identify the informative signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This decomposition suggests that much of the domain shift in hallucination signals aligns with question features in the representation space.
The method could be extended to other single-pass detection tasks where separating context from content is useful.
Further gains might come from applying similar orthogonal projections in multimodal or multi-turn settings.

Load-bearing premise

Removing the question-aligned direction from the answer representation suppresses domain-conditioned variation and isolates reliable factuality signals usable for both in-domain and OOD detection.

What would settle it

Running the orthogonal-only probe on a held-out domain shift dataset such as BioASQ and observing AUROC no better than the best white-box baseline would falsify the OOD robustness claim.

Figures

Figures reproduced from arXiv: 2605.14449 by Erhu Feng, Siyang Yao, Yubin Xia.

**Figure 1.** Figure 1: QAOD architecture: an offline selection branch identifies discriminative layers and neurons via Fisher-based scores; the online backbone extracts selected features and computes question-orthogonal components in a single forward pass. 3.2 Orthogonal deviation computation For each transformer layer l ∈ {1, . . . , L}, we compute the question-orthogonal component v (l) ⊥ by removing from the answer representa… view at source ↗

**Figure 2.** Figure 2: Zero-shot OOD AUROC on BioASQ across four LLMs. Bars show OOD AUROC for detectors trained [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Layer-wise normalized centroid shift (∥µsrc − µtgt∥/σ¯) from {TriviaQA, SQuAD, NQ} to BioASQ. The question-orthogonal component v⊥ exhibits the smallest displacement across all layers, with domain-dependent variation concentrated in the question-aligned component h∥. h∥ = h (l) A ·h (l) Q ∥h (l) Q ∥2 h (l) Q undergoes the largest displacement, the original answer state hA lies in the middle, and the questi… view at source ↗

**Figure 5.** Figure 5: Layer-selection ablation on Qwen3-14B / TriviaQA (AUROC, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Neuron-selection ablation on the cumulative Fisher threshold [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Inference-time comparison (log scale), including shared generation cost and additional detection overhead; [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD's joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QAOD's orthogonal projection for hallucination detection is a clean idea with real efficiency gains, but the OOD transfer story depends on an assumption about the removed direction that the experiments don't directly test.

read the letter

The core move here is projecting the question-aligned direction out of the answer hidden states to isolate a component that supposedly carries more stable factuality signals. They pair that with diversity-penalized Fisher scoring to pick layers and neurons, then run two probes: a joint one that keeps the question context for strong in-domain AUROC, and an orthogonal-only version that aims at better cross-domain transfer. The reported numbers show the joint probe winning in-domain across their model-dataset pairs and the orthogonal probe beating the best white-box baseline by up to 21% on BioASQ while using under a quarter of the generation cost. That efficiency part is genuinely useful for anyone who needs single-pass monitoring instead of repeated sampling. The layer selection trick also looks like a practical way to avoid picking everything or nothing. The soft spot is the untested assumption the stress-test flags. Nothing in the write-up shows that the removed subspace correlates more strongly with domain labels than with hallucination labels, or that the retained orthogonal part keeps its discriminative power after the projection. If the removed direction is carrying a mix of signals, both the in-domain win and the OOD lift become less reliable. The abstract gives performance claims without error bars, dataset sizes, or ablation tables, so the full experiments will need to carry the weight. This is aimed at people building practical hallucination detectors for deployed LLMs, especially those who care about domain shift. It is worth sending to peer review because the framing is clear, the efficiency angle is concrete, and the decomposition is a fresh enough combination to deserve referee scrutiny even if the central assumption needs more direct evidence.

Referee Report

3 major / 2 minor

Summary. The paper introduces QAOD, a single-pass hallucination detection framework for LLMs that performs orthogonal decomposition on answer hidden states to remove question-aligned directions, thereby suppressing domain-specific variation. It uses diversity-penalized Fisher scoring to select layers and Fisher importance for neurons, and proposes a joint probe (orthogonal component + question context) for in-domain detection and an orthogonal-only probe for OOD generalization. The authors claim that the joint probe achieves the best in-domain AUROC across evaluated models and datasets, while the orthogonal-only probe provides the strongest OOD transfer, outperforming white-box baselines by up to 21% on BioASQ with less than 25% of the generation cost.

Significance. If the empirical results are robust and the core assumption holds, this approach offers a promising balance of efficiency and robustness for hallucination detection, particularly in cross-domain settings where consistency-based methods are computationally expensive. The use of orthogonal projection to isolate factuality signals could advance white-box probing techniques.

major comments (3)

[Method] The central claim that the question-orthogonal component isolates domain-agnostic factuality signals rests on an untested assumption; no analysis demonstrates that the removed question-aligned subspace correlates more strongly with domain labels than with hallucination labels (see the decomposition construction and OOD transfer claims).
[Abstract and Experiments] Performance numbers in the abstract (e.g., best in-domain AUROC and up to 21% OOD gain on BioASQ) are stated without accompanying experimental details such as dataset sizes, number of runs, error bars, or ablation results on layer/neuron selection, undermining verifiability of the reported superiority.
[§3.2] The diversity-penalized Fisher scoring for layer selection introduces free parameters that appear tuned on evaluation data, creating a risk of circularity in the reported in-domain and OOD results.

minor comments (2)

[Abstract] The abstract contains typographical errors (e.g., 'accu racy', 'or thogonal') that should be corrected for clarity.
[Method] An explicit equation for the orthogonal projection operator would improve the readability of the decomposition step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Method] The central claim that the question-orthogonal component isolates domain-agnostic factuality signals rests on an untested assumption; no analysis demonstrates that the removed question-aligned subspace correlates more strongly with domain labels than with hallucination labels (see the decomposition construction and OOD transfer claims).

Authors: We agree that a direct quantitative comparison of correlations between the question-aligned subspace and domain versus hallucination labels would provide stronger support for the claim. The current evidence is primarily indirect via the OOD transfer gains of the orthogonal-only probe. In the revision we will add an analysis computing Pearson correlations of the removed directions with domain labels and hallucination labels across layers, and report these results in a new subsection of Section 3. revision: yes
Referee: [Abstract and Experiments] Performance numbers in the abstract (e.g., best in-domain AUROC and up to 21% OOD gain on BioASQ) are stated without accompanying experimental details such as dataset sizes, number of runs, error bars, or ablation results on layer/neuron selection, undermining verifiability of the reported superiority.

Authors: The full experimental details (dataset sizes, number of runs, standard deviations, and layer/neuron ablation tables) are already present in Section 4 and Appendix B. To improve verifiability we will revise the abstract to briefly note the number of runs and report mean AUROC with standard deviation for the key claims, while retaining the concise summary format. revision: partial
Referee: [§3.2] The diversity-penalized Fisher scoring for layer selection introduces free parameters that appear tuned on evaluation data, creating a risk of circularity in the reported in-domain and OOD results.

Authors: The diversity penalty coefficient and layer-selection threshold were chosen on a held-out validation split drawn from the in-domain training distribution, separate from both the in-domain test sets and the OOD evaluation sets. We will expand Section 3.2 to explicitly describe this validation procedure, list the exact hyperparameter values, and add a sensitivity plot showing performance variation with the penalty coefficient. revision: yes

Circularity Check

0 steps flagged

No significant circularity in QAOD derivation chain

full rationale

The paper's core construction applies standard orthogonal projection to remove a question-aligned direction from answer representations, followed by Fisher-based layer and neuron selection. These steps are presented as methodological choices whose outputs are then evaluated empirically on in-domain and OOD benchmarks. No equation or procedure reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on self-citation. The reported AUROC gains are empirical outcomes rather than tautological consequences of the decomposition definition. The framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Core method rests on the unproven domain assumption that question-orthogonal components isolate factuality signals; layer and neuron selection via Fisher scoring introduces fitted parameters whose values are not reported.

free parameters (1)

diversity-penalized Fisher scoring parameters
Layer selection and neuron importance scores are computed from data and therefore fitted.

axioms (1)

domain assumption Projecting away the question-aligned direction suppresses domain-conditioned variation
Central premise of the orthogonal decomposition step.

pith-pipeline@v0.9.0 · 5518 in / 1258 out tokens · 35274 ms · 2026-05-15T01:54:27.392639+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

[1]

2023 , eprint=

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models , author=. 2023 , eprint=

work page 2023
[2]

doi: 10.1038/ s41586-024-07421-0

Farquhar, Sebastian and Kossen, Jannik and Kuhn, Lorenz and Gal, Yarin , title =. Nature , year =. doi:10.1038/s41586-024-07421-0 , url =

work page doi:10.1038/s41586-024-07421-0
[3]

2022 , eprint=

Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

work page 2022
[4]

Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

Su, Weihang and Wang, Changyue and Ai, Qingyao and Hu, Yiran and Wu, Zhijing and Zhou, Yujia and Liu, Yiqun. Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.854

work page doi:10.18653/v1/2024.findings-acl.854 2024
[5]

2024 , eprint=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. 2024 , eprint=

work page 2024
[6]

2021 , eprint=

Implicit Representations of Meaning in Neural Language Models , author=. 2021 , eprint=

work page 2021
[7]

2023 , eprint=

The Internal State of an LLM Knows When It's Lying , author=. 2023 , eprint=

work page 2023
[8]

and Herrmann, Daniel A

Levinstein, Benjamin A. and Herrmann, Daniel A. , year=. Still no lie detector for language models: probing empirical and conceptual roadblocks , volume=. Philosophical Studies , publisher=. doi:10.1007/s11098-023-02094-3 , number=

work page doi:10.1007/s11098-023-02094-3
[9]

2024 , eprint=

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models , author=. 2024 , eprint=

work page 2024
[10]

2025 , eprint=

SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs , author=. 2025 , eprint=

work page 2025
[11]

2024 , eprint=

INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection , author=. 2024 , eprint=

work page 2024
[12]

2025 , eprint=

Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Ambiguous Prompts and Unanswerable Questions , author=. 2025 , eprint=

work page 2025
[13]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017
[14]

https://aclanthology.org/ Q19-1026/

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title...

work page doi:10.1162/tacl_a_00276 2019
[15]

2018 , eprint=

Know What You Don't Know: Unanswerable Questions for SQuAD , author=. 2018 , eprint=

work page 2018
[16]

2022 , doi =

Krithara, Anastasia and Nentidis, Anastasios and Bougiatiotis, Konstantinos and Paliouras, Georgios , title =. 2022 , doi =. https://www.biorxiv.org/content/early/2022/12/16/2022.12.14.520213.full.pdf , journal =

work page 2022
[17]

2024 , eprint=

Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

work page 2024
[18]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[19]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[20]

2024 , eprint=

A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models , author=. 2024 , eprint=

work page 2024
[21]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , year=. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=. ACM Transactions on Information Systems , publis...

work page doi:10.1145/3703155
[22]

ACM Comput

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , year=. Survey of Hallucination in Natural Language Generation , volume=. ACM Computing Surveys , publisher=. doi:10.1145/3571730 , number=

work page doi:10.1145/3571730
[23]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[24]

Large language models and the perils of their hallucinations , volume =

Azamfirei, Razvan and Kudchadkar, Sapna and Fackler, James , year =. Large language models and the perils of their hallucinations , volume =. Critical Care , doi =

work page
[25]

2022 , eprint=

Teaching Models to Express Their Uncertainty in Words , author=. 2022 , eprint=

work page 2022
[26]

Computational Linguistics , year =

Belinkov, Yonatan. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics. 2022. doi:10.1162/coli_a_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[27]

2024 , eprint=

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. 2024 , eprint=

work page 2024
[28]

LLM-Check: Investigating Detection of Hallucinations in Large Language Models , url =

Sriramanan, Gaurang and Bharti, Siddhant and Sadasivan, Vinu Sankar and Saha, Shoumik and Kattakinda, Priyatham and Feizi, Soheil , booktitle =. LLM-Check: Investigating Detection of Hallucinations in Large Language Models , url =. doi:10.52202/079017-1077 , editor =

work page doi:10.52202/079017-1077
[29]

2024 , eprint=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2024 , eprint=

work page 2024
[30]

2024 , eprint=

Factuality of Large Language Models: A Survey , author=. 2024 , eprint=

work page 2024
[31]

2024 , eprint=

Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models , author=. 2024 , eprint=

work page 2024
[32]

2025 , eprint=

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models , author=. 2025 , eprint=

work page 2025
[33]

2025 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2025 , eprint=

work page 2025
[34]

2023 , eprint=

FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios , author=. 2023 , eprint=

work page 2023
[35]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

Detecting Hallucination in Large Language Models Through Deep Internal Representation Analysis , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , month =. doi:10.24963/ijcai.2025/929 , url =

work page doi:10.24963/ijcai.2025/929 2025
[36]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

work page 2023
[37]

2019 , eprint=

Similarity of Neural Network Representations Revisited , author=. 2019 , eprint=

work page 2019
[38]

Ranked Voting based Self-Consistency of Large Language Models

Wang, Weiqin and Wang, Yile and Huang, Hui. Ranked Voting based Self-Consistency of Large Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.744

work page doi:10.18653/v1/2025.findings-acl.744 2025