{"total":20,"items":[{"citing_arxiv_id":"2605.04641","ref_index":100,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering","primary_cat":"cs.CV","submitted_at":"2026-05-06T08:32:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08145","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-03T06:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A self-captioning method using a Multimodal Interaction Gate amplifies redundant interactions to reduce visual-induced errors by 38.3% and improve consistency by 16.8% in vision-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01733","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-03T06:09:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GEASS selectively gates and weights self-generated captions using confidence and entropy to reduce object hallucinations in VLMs, outperforming vanilla inference and contrastive decoding on POPE and HallusionBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00323","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Online Self-Calibration Against Hallucination in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-01T01:03:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26614","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading","primary_cat":"cs.CV","submitted_at":"2026-04-29T12:41:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26419","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Delineating Knowledge Boundaries for Honest Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-29T08:29:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25642","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-28T13:42:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregressive decoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12357","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReflectCAP: Detailed Image Captioning with Reflective Memory","primary_cat":"cs.AI","submitted_at":"2026-04-14T06:47:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10219","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-04-11T13:59:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07914","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction","primary_cat":"cs.CV","submitted_at":"2026-04-09T07:31:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.13106","ref_index":86,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-01-22T18:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Table 3 Data mixture in massive multi-task fine-tuning stage. Task Dataset Amount Image & Text Data General LLaVA-SFT-665K [38], LLaVA-OV-SI [29], Cambrian-cleaned [39], Pixmo (docs, cap, points, cap-qa, ask-model-anything) [35] 9.87M Document DocVQA [40], Docmatix [41] 1.31M Chart/Figure ChartQA [42], MMC_Instruction [83], DVQA [84], LRV_Instruction [85], Chart- Gemma [86], InfoVQA [87], PlotQA [88] 1.00M OCR MultiUI [89], in-house data 0.83M Grounding RefCoco [90], VCR [91], in-house data 0.50M Multi-Image Demon-Full [92], Contrastive_Caption [93] 0.41M Text-only Magpie [94], Magpie-Pro [94], Synthia [95], Infinity-Instruct-subjective [82], Numina- Math [96] 2.21M Video & Text Data General LLaVA-Video-178K [25], ShareGPT4o-Video [28], FineVideo [97], CinePile [98],"},{"citing_arxiv_id":"2412.05271","ref_index":148,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"GQA [98], OKVQA [178], A-OKVQA [205], Visual7W [317], VisText [226], VSR [147], TallyQA [2],General QA Objects365-YorN [208], IconQA [167], Stanford40 [273], VisDial [51], VQAv2 [74], Hateful-Memes [111] MA VIS [300], GeomVerse [107], MetaMath-Rendered [281], MapQA [23], GeoQA+ [20], Geometry3K [164],Mathematics UniGeo [26], GEOS [206], CLEVR-Math [144] ChartQA [181], PlotQA [187], FigureQA [105], LRV-Instruction [148], ArxivQA [132], MMC-Inst [149], TabMWP [166], DVQA [104], UniChart [182], SimChart9K [263], Chart2Text [191], FinTabNet [312],Chart SciTSR [39], Synthetic Chart2Markdown LaionCOCO-OCR [204], Wukong-OCR [75], ParsynthOCR [89], SynthDoG-EN [112], SynthDoG-ZH [112], SynthDoG-RU [112], SynthDoG-JP [112], SynthDoG-KO [112], IAM [180], EST-VQA [253], ST-VQA [17],"},{"citing_arxiv_id":"2408.03326","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLaVA-OneVision: Easy Visual Task Transfer","primary_cat":"cs.CV","submitted_at":"2024-08-06T17:59:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"[78] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. 37, 39 27 [79] Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transac- tions of the Association for Computational Linguistics, 2023. 39 [80] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 39 [81] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024. 1, 3, 6, 37 [82] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee."},{"citing_arxiv_id":"2408.01800","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","primary_cat":"cs.CV","submitted_at":"2024-08-03T15:02:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.18930","ref_index":113,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hallucination of Multimodal Large Language Models: A Survey","primary_cat":"cs.CV","submitted_at":"2024-04-29T17:59:41+00:00","verdict":"ACCEPT","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"BehaviorMOCHa [11] EMNLP'24 Synthetic 2,000 Gen OpenCHAIR [11] ✓ ✓ ✗ ✗CorrelationQA [55] arXiv'24 Feb. Synthetic 7,308 Dis Acc/AccDrop ✗ ✗ ✗ Model BiasVQAv2-IDK [17] ICASSP'24 VQAv2 [50] 6,624 Dis Acc ✗ ✗ ✗ IK [17]MHaluBench [25] ACL'24 MSCOCO [105] 1,860 Gen Acc/P/R/F ✓ ✓ ✗ T2IVHTest [67] ACL'24 MSCOCO [105] 1,200 Dis & Gen Acc ✓ ✓ ✗ ✓ Hal-Eavl [74] MM'24 MSCOCO [105] &LAION [142] 10,000 Dis & GenAcc/P/R/F &LLM Assessment✓ ✓ ✓ Obj. Event PhD [113] arXiv'24 Mar. TDIUC [80] & AIGC 102,564 Dis PhD Index✓ ✓ ✓ SentimentTHRONE [82] CVPR'24 MSCOCO [105] 5,000, Gen P/R/F ✓ ✗ ✗ ✗BEAF [186] ECCV'24 MSCOCO [105] 26,118 Dis TU/IG/SB/ID ✓ ✗ ✗ ✗ ROPE [24] NeurIPS'24 MSCOCO [105] &ADE20k [218] 5,000 Dis Acc ✓ ✗ ✗ Multi Obj. LongHalQA [132] arXiv'24 Oct.VisualGenome [89] &Objects365 [145]6,485 Dis & Gen Acc ✓ ✓ ✓ Obj."},{"citing_arxiv_id":"2404.16821","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites","primary_cat":"cs.CV","submitted_at":"2024-04-25T17:59:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.00253","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Hallucination in Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2024-02-01T00:33:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.14238","ref_index":91,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","submitted_at":"2023-12-21T18:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.05232","ref_index":192,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions","primary_cat":"cs.CL","submitted_at":"2023-11-09T09:25:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"valuable insights, our survey seeks to delineate their distinct contributions and the comprehensive scope they encompass. Ji et al. [136] primarily shed light on hallucinations in pre-trained models for NLG tasks, leaving LLMs outside their discussion purview. Tonmoy et al. [298] mainly focused on discussing the mitigation strategies combating LLM hallucinations. Besides, Liu et al. [192] took a broader view of LLM trustworthiness without delving into specific hallucination phenomena, whereas Wang et al. [312] provided an in-depth look at factuality in LLMs. However, our work nar- rows down to a critical subset of trustworthiness challenges, specifically addressing factuality and extending the discussion to include faithfulness hallucinations."},{"citing_arxiv_id":"2306.13394","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2023-06-23T09:22:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}