arxiv: 2604.06714 · v1 · submitted 2026-04-08 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

Recognition: no theorem link

Steering the Verifiability of Multimodal AI Hallucinations

Jianhong Pang , Ruoxi Cheng , Ziyi Ye , Xingjun Ma , Zuxuan Wu , Xuanjing Huang , Yu-Gang Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG

keywords multimodal AIhallucinationsverifiabilityactivation spaceintervention probesobvious hallucinationselusive hallucinationsMLLMs

0 comments

The pith

Separate probes in activation space let multimodal models steer the verifiability of their hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hallucinations in multimodal AI models vary in how readily humans can detect them, with some obvious and others elusive. The paper collects thousands of human responses to label these differences and builds a dataset of obvious versus elusive cases. It then develops an intervention technique that learns distinct probes inside the model's activation space for each category. These probes allow targeted adjustments that regulate how verifiable the model's outputs become. If the method works, AI applications could be tuned for stricter or more lenient error checking based on the use case, from high-stakes tasks to everyday queries.

Core claim

The authors construct a dataset from 4,470 human responses that categorizes AI-generated hallucinations into obvious and elusive types according to human verifiability. They propose an activation-space intervention method that learns separate probes for the two types. Experiments show that obvious and elusive hallucinations trigger different probes, targeted interventions outperform general ones at regulating the matching verifiability, and simply mixing the probes produces flexible control suited to different security and usability demands.

What carries the argument

Activation-space intervention method that learns separate probes for obvious and elusive hallucinations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The probes could be extended to modulate other output properties such as confidence levels or level of detail.
Dynamic mixing during generation might allow real-time adjustment to match user-specified risk tolerance.
The approach may transfer to text-only models if similar human-labeled categories can be collected.

Load-bearing premise

Human responses provide a reliable and generalizable way to categorize hallucinations as obvious or elusive based on verifiability.

What would settle it

New human evaluations on outputs after probe application that show no measurable change in detection effort or accuracy for obvious versus elusive hallucinations relative to the unadjusted model.

read the original abstract

AI applications driven by multimodal large language models (MLLMs) are prone to hallucinations and pose considerable risks to human users. Crucially, such hallucinations are not equally problematic: some hallucination contents could be detected by human users(i.e., obvious hallucinations), while others are often missed or require more verification effort(i.e., elusive hallucinations). This indicates that multimodal AI hallucinations vary significantly in their verifiability. Yet, little research has explored how to control this property for AI applications with diverse security and usability demands. To address this gap, we construct a dataset from 4,470 human responses to AI-generated hallucinations and categorize these hallucinations into obvious and elusive types based on their verifiability by human users. Further, we propose an activation-space intervention method that learns separate probes for obvious and elusive hallucinations. We reveal that obvious and elusive hallucinations elicit different intervention probes, allowing for fine-grained control over the model's verifiability. Empirical results demonstrate the efficacy of this approach and show that targeted interventions yield superior performance in regulating corresponding verifiability. Moreover, simply mixing these interventions enables flexible control over the verifiability required for different scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical way to steer how detectable hallucinations are in multimodal models via separate activation probes, but the human labeling step is the weak link.

read the letter

The main takeaway is that they built a dataset of 4,470 human responses to split hallucinations into obvious ones (easy for people to catch) and elusive ones (harder to spot), then learned distinct activation-space probes for each. Targeted interventions with the matching probe improve control over the corresponding verifiability, and a linear mix of the two probes lets you adjust the overall level for different use cases. This is a shift from blanket hallucination reduction toward tunable detectability, which fits real deployment trade-offs between safety and usability. The activation approach is lightweight and the empirical edge for targeted probes over generic ones is a clear positive result on their terms. The soft spot is the foundation in human labels. The abstract gives no inter-annotator agreement numbers, no protocol details, and no checks that the same hallucination gets consistent obvious/elusive tags across users or contexts. If those labels are noisy or context-dependent, the probes risk capturing annotator artifacts rather than stable directions in the model, which would undercut both the superiority claim and the mixing controllability. Full results would also need to show that the interventions leave non-hallucinated outputs intact and do not create new problems. This is for researchers focused on hallucination management in vision-language models who want more than on/off suppression. A reader working on safety knobs or controllable generation would find the dataset and probe idea worth testing. It deserves a serious referee because the method is concrete and the question is practical, even if revisions will need to tighten the label validation and add side-effect checks. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that hallucinations in multimodal LLMs vary in verifiability (obvious vs. elusive to humans), constructs a dataset of 4,470 human responses to label them accordingly, and proposes learning separate activation-space probes for each type. Targeted interventions using these probes are shown to regulate the corresponding verifiability more effectively than alternatives, while linear mixing of the probes enables flexible control over verifiability levels for different application scenarios.

Significance. If the human labels prove stable and the probes demonstrably isolate verifiability directions without collateral effects on model capability, the approach would provide a practical, tunable mechanism for steering MLLM outputs in contexts with differing security or usability requirements, extending activation-engineering techniques to a new controllable property.

major comments (2)

[Dataset Construction] Dataset construction (human labeling of 4,470 responses): no inter-annotator agreement statistics, annotation guidelines, or cross-context consistency checks are reported. Because the probes are learned directly from these labels, high label noise would cause the probes to fit annotator-specific artifacts rather than reproducible verifiability features, directly undermining both the superiority claim for targeted interventions and the controllability of mixtures.
[Experiments / Results] Empirical results section: the abstract asserts superior performance for targeted probes and flexible control via mixing, yet the manuscript provides no ablation isolating probe specificity (e.g., effect on non-hallucinated outputs), no statistical significance tests, and no comparison against strong baselines such as random or single-probe interventions. These omissions leave the central empirical support for load-bearing claims unverified.

minor comments (2)

[Method] Clarify the precise linear-algebraic definition of the mixing operation and the loss used to train the probes; the current description leaves the intervention formula ambiguous.
[Figures] Figure captions and axis labels in the results figures should explicitly state the metric (e.g., human verifiability score or detection rate) and the number of trials per condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the empirical and methodological rigor of the work.

read point-by-point responses

Referee: [Dataset Construction] Dataset construction (human labeling of 4,470 responses): no inter-annotator agreement statistics, annotation guidelines, or cross-context consistency checks are reported. Because the probes are learned directly from these labels, high label noise would cause the probes to fit annotator-specific artifacts rather than reproducible verifiability features, directly undermining both the superiority claim for targeted interventions and the controllability of mixtures.

Authors: We agree that inter-annotator agreement statistics are essential for validating label quality. In the revised manuscript we will report Fleiss' kappa (or equivalent) computed over the multiple annotators who labeled the 4,470 responses. We will also append the complete annotation guidelines, which explicitly define obvious hallucinations as those detectable by visual inspection of the image alone and elusive hallucinations as those requiring external knowledge or verification effort. Annotations were performed under a standardized protocol with training examples and quality checks across image-question contexts; we will add a brief consistency analysis (e.g., agreement stratified by image category) to address potential context-specific artifacts. These additions directly respond to the concern that label noise could undermine the learned probes. revision: yes
Referee: [Experiments / Results] Empirical results section: the abstract asserts superior performance for targeted probes and flexible control via mixing, yet the manuscript provides no ablation isolating probe specificity (e.g., effect on non-hallucinated outputs), no statistical significance tests, and no comparison against strong baselines such as random or single-probe interventions. These omissions leave the central empirical support for load-bearing claims unverified.

Authors: We accept that the current experimental section lacks several standard controls. In the revision we will add (1) an ablation measuring probe effects on non-hallucinated outputs to demonstrate specificity, (2) statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with p-values) for all reported performance differences, and (3) explicit comparisons against random-direction interventions and single-probe baselines. These new results will be presented in an expanded results section and will directly support the claims of targeted superiority and flexible mixing control. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probe learning from external human labels

full rationale

The paper's core chain begins with an external dataset of 4,470 human responses used to label hallucinations as obvious or elusive, followed by learning separate activation-space probes and testing targeted interventions on verifiability. No step reduces by construction to its own inputs: the probes are fitted to human-provided labels rather than self-defined quantities, the claimed superiority of targeted vs. mixed interventions is evaluated empirically (not forced by the fitting procedure itself), and no self-citations or uniqueness theorems are invoked as load-bearing premises. The derivation remains self-contained against the external human-label benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; the central approach assumes human labels define meaningful verifiability categories and that activation interventions can selectively affect them. No numerical free parameters or external benchmarks are specified.

axioms (1)

domain assumption Human responses to AI hallucinations provide a consistent and generalizable basis for distinguishing obvious from elusive types.
The dataset construction and subsequent probe learning rest directly on these human judgments.

invented entities (1)

Type-specific activation-space probes for obvious and elusive hallucinations no independent evidence
purpose: To detect and intervene on distinct verifiability patterns in model activations
Probes are learned from the new dataset; no independent falsifiable evidence outside the paper is mentioned.

pith-pipeline@v0.9.0 · 5523 in / 1378 out tokens · 78181 ms · 2026-05-10T18:16:06.795689+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Membership Inference for Contrastive Pre-training Models with Text-only PII Queries
cs.CR 2026-03 unverdicted novelty 7.0

UMID infers membership in contrastive pre-training data using only text queries by performing latent inversion and comparing similarity and variability signals to synthetic gibberish references via unsupervised anomal...

Reference graph

Works this paper leans on

44 extracted references · 10 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

XiangAn,YinXie,KaichengYang,WenkangZhang,XiuweiZhao,ZhengCheng,YiruiWang,SongcenXu,Changrui Chen, ChunshengWu,HuajieTan,ChunyuanLi,JingYang,JieYu,XiyaoWang, BinQin, YumengWang,ZizhenYan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025. URLhttps://arxiv.org/abs/2509.23661

work page internal anchor Pith review arXiv 2025
[2]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37: 136037–136083, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37: 136037–136083, 2024

2024
[3]

The internal state of an llm knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. InFindingsofthe Association forComputationalLinguistics: EMNLP 2023, pages 967–976, 2023

2023
[4]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXivpreprintarXiv:2404.18930, 2024

work page internal anchor Pith review arXiv 2024
[5]

Leace: Perfect linear concept erasure in closed form.AdvancesinNeuralInformationProcessingSystems, 36:66044–66063, 2023

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. Leace: Perfect linear concept erasure in closed form.AdvancesinNeuralInformationProcessingSystems, 36:66044–66063, 2023

2023
[6]

Dola: Decoding by contrasting layers improves factuality in large language models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. InThe TwelfthInternational Conference on Learning Representations
[7]

I don’t know: Explicit modeling of uncertainty with an [idk] token.AdvancesinNeuralInformationProcessingSystems, 37:10935–10958, 2024

Roi Cohen, Konstantin Dobler, Eden Biran, and Gerard de Melo. I don’t know: Explicit modeling of uncertainty with an [idk] token.AdvancesinNeuralInformationProcessingSystems, 37:10935–10958, 2024

2024
[8]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings ofthe 32nd ACMInternational Conferenceon Multimedia, pages 11198–11201, 2024

2024
[9]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

2024
[10]

Don’thallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration

ShangbinFeng,WeĳiaShi,YikeWang,WenxuanDing,VidhishaBalachandran,andYuliaTsvetkov. Don’thallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration. InProceedingsofthe 62ndAnnualMeetingof the AssociationforComputationalLinguistics(Volume1: LongPapers), pages 14664–14690, 2024

2024
[11]

Mme: A comprehensive evaluation benchmark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninthAnnualConferenceon NeuralInformationProcessingSystemsDatasetsand BenchmarksTrack
[12]

Xuefeng Du, Chaowei Xiao, and Yixuan Li

ChengGao, HuiminChen, ChaojunXiao, ZhiyiChen, ZhiyuanLiu, andMaosongSun. H-neurons: Ontheexistence, impact, and origin of hallucination-associated neurons in llms.arXiv preprintarXiv:2512.01797, 2025

work page arXiv 2025
[13]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, 2023

2023
[14]

Anah-v2: Scalinganalyticalhallucination annotation of large language models.AdvancesinNeuralInformationProcessingSystems, 37:60012–60039, 2024

YuzheGu,ZiweiJi,WenweiZhang,ChengqiLyu,DahuaLin,andKaiChen. Anah-v2: Scalinganalyticalhallucination annotation of large language models.AdvancesinNeuralInformationProcessingSystems, 37:60012–60039, 2024

2024
[15]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xĳun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, page...

2024
[16]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACMTransactionson InformationSystems, 43(2):1–55, 2025

Lei Huang, Weĳiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACMTransactionson InformationSystems, 43(2):1–55, 2025. 13

2025
[17]

Survey of hallucination in natural language generation.ACMcomputingsurveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACMcomputingsurveys, 55(12):1–38, 2023

2023
[18]

Calibrating verbal uncertainty as a linear feature to reduce hallucinations

Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, Pascale Fung, and Nicola Cancedda. Calibrating verbal uncertainty as a linear feature to reduce hallucinations. InProceedings of the 2025Conferenceon EmpiricalMethods inNaturalLanguageProcessing, pages 3769–3793, 2025

2025
[19]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedingsofthe 2023conferenceonempiricalmethods innaturallanguage processing, pages 292–305, 2023

2023
[21]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXivpreprintarXiv:2402.00253, 2024

work page internal anchor Pith review arXiv 2024
[22]

Visual instruction tuning.Advances in neural informationprocessingsystems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural informationprocessingsystems, 36:34892–34916, 2023

2023
[23]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computervision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computervision, pages 216–233. Springer, 2024

2024
[24]

Fine-grained hallucination detection and editing for language models.arXivpreprintarXiv:2401.06855, 2024

Abhika Mishra, Akari Asai, Vidhisha Balachandran, Yizhong Wang, Graham Neubig, Yulia Tsvetkov, and Hannaneh Hajishirzi. Fine-grained hallucination detection and editing for language models.arXivpreprintarXiv:2401.06855, 2024

work page arXiv 2024
[25]

Controlling hallucinations at word level in data-to-text generation.DataMiningandKnowledgeDiscovery, 36(1): 318–354, 2022

Clément Rebuffel, Marco Roberti, Laure Soulier, Geoffrey Scoutheeten, Rossella Cancelliere, and Patrick Gallinari. Controlling hallucinations at word level in data-to-text generation.DataMiningandKnowledgeDiscovery, 36(1): 318–354, 2022

2022
[26]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedingsofthe IEEE/CVFconferenceoncomputervisionand pattern recognition, pages 8317–8326, 2019

2019
[27]

On early detection of hallucinations in factual question answering

Ben Snyder, Marius Moisescu, and Muhammad Bilal Zafar. On early detection of hallucinations in factual question answering. InProceedingsofthe30thACMSIGKDDConferenceonKnowledgeDiscoveryandDataMining,pages 2721–2732, 2024

2024
[28]

Whatlargelanguagemodelsknowandwhatpeoplethinktheyknow

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W Mayer, and PadhraicSmyth. Whatlargelanguagemodelsknowandwhatpeoplethinktheyknow. NatureMachineIntelligence, 7(2):221–231, 2025

2025
[29]

Activation steering decoding: Mitigating hallucination in large vision-language models through bidirectional hidden state intervention

Jingran Su, Jingfan Chen, Hongxin Li, Yuntao Chen, Li Qing, and Zhaoxiang Zhang. Activation steering decoding: Mitigating hallucination in large vision-language models through bidirectional hidden state intervention. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12964–12974, 2025

2025
[30]

Towards verifiable text generation with evolving memory and self-reflection

Hao Sun, Hengyi Cai, Bo Wang, Yingyan Hou, Xiaochi Wei, Shuaiqiang Wang, Yan Zhang, and Dawei Yin. Towards verifiable text generation with evolving memory and self-reflection. InProceedings of the 2024 Conference on EmpiricalMethods in NaturalLanguageProcessing, pages 8211–8227, 2024

2024
[31]

Cross-layer attention probing for fine-grained hallucination detection.arXivpreprintarXiv:2509.09700, 2025

Malavika Suresh, Rahaf Aljundi, Ikechukwu Nkisi-Orji, and Nirmalie Wiratunga. Cross-layer attention probing for fine-grained hallucination detection.arXivpreprintarXiv:2509.09700, 2025

work page arXiv 2025
[32]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025. URLhttps://qwenlm.github.io/blog/qwen2.5-vl/

2025
[33]

Uncertainty-based abstention in llms improves safety and reduces hallucinations.arXiv preprint arXiv:2404.10960, 2024

Christian Tomani, Kamalika Chaudhuri, Ivan Evtimov, Daniel Cremers, and Mark Ibrahim. Uncertainty-based abstention in llms improves safety and reduces hallucinations.arXiv preprintarXiv:2404.10960, 2024. 14

work page arXiv 2024
[34]

A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation,

Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation.arXiv preprint arXiv:2307.03987, 2023

work page arXiv 2023
[35]

Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023

Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, JiZhang,etal. Amber: Anllm-freemulti-dimensionalbenchmarkformllmshallucinationevaluation. arXivpreprint arXiv:2311.07397, 2023

work page arXiv 2023
[36]

Reft: Representationfinetuningforlanguagemodels

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and ChristopherPotts. Reft: Representationfinetuningforlanguagemodels. AdvancesinNeuralInformationProcessing Systems, 37:63908–63962, 2024

2024
[37]

On hallucination and predictive uncertainty in conditional language generation

Yĳun Xiao and William Yang Wang. On hallucination and predictive uncertainty in conditional language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2734–2744, 2021

2021
[38]

Region-based cluster discrimination for visual representation learning

Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Roy Miles, et al. Region-based cluster discrimination for visual representation learning. InProceedings of the IEEE/CVF International Conferenceon ComputerVision, pages 1793–1803, 2025

2025
[39]

A new benchmark and reverse validation method for passage- level hallucination detection

Shiping Yang, Renliang Sun, and Xiaojun Wan. A new benchmark and reverse validation method for passage- level hallucination detection. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 3898–3908, 2023

2023
[40]

Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

ShukangYin,ChaoyouFu,SiruiZhao,TongXu,HaoWang,DianboSui,YunhangShen,KeLi,XingSun,andEnhong Chen. Woodpecker: Hallucination correction for multimodal large language models.Science China Information Sciences, 67(12):220105, 2024

2024
[41]

Mmmu-pro: Amorerobustmulti-disciplinemultimodalunderstandingbenchmark

XiangYue,TianyuZheng,YuanshengNi,YuboWang,KaiZhang,ShengbangTong,YuxuanSun,BotaoYu,GeZhang, HuanSun,etal. Mmmu-pro: Amorerobustmulti-disciplinemultimodalunderstandingbenchmark. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

2025
[42]

Enhancing uncertainty-based hallucination detection with stronger focus

Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, and Luoyi Fu. Enhancing uncertainty-based hallucination detection with stronger focus. InProceedings of the 2023 Conferenceon EmpiricalMethods inNaturalLanguageProcessing, pages 915–932, 2023

2023
[43]

unclear” or “I don’t know

Kaitlyn Zhou, Jena Hwang, Xiang Ren, and Maarten Sap. Relying on the unreliable: The impact of language models’ reluctancetoexpressuncertainty. In Proceedingsofthe62ndAnnualMeetingoftheAssociationforComputational Linguistics(Volume1: Long Papers), pages 3623–3643, 2024. 15 Appendix A Prompts A.1 Prompt Construction for Description Detailed prompt template...

2024
[44]

Table 6Selected intervention coefficients for OHI and EHI

Inotherwords,wetreattheacceptableinterventionstrengthasastablerangeratherthanauniquebestpoint. Table 6Selected intervention coefficients for OHI and EHI. Model𝛼 oh 𝛼eh Qwen2.5-VL-3B 0.90 0.90 Qwen2.5-VL-7B 0.80 0.70 LLaVA-OneVision-1.5-8B 0.80 0.80 For Qwen2.5-VL-3B, the validation curves in Fig- ure9becomerelativelyflatinthehigh-performing region, and we...