{"total":16,"items":[{"citing_arxiv_id":"2606.31054","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLMs","primary_cat":"cs.CV","submitted_at":"2026-06-30T02:46:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ADAPT reduces MLLM hallucinations 40-60% by aligning cross-attention dynamics via visual anchors, supervised inference, and preference tuning while preserving general capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29805","ref_index":78,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Clearer Sight, Fewer Lies: Oriented Pickup Preference Optimization for Multimodal Hallucination Mitigation","primary_cat":"cs.CV","submitted_at":"2026-06-29T05:33:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPPO is an evidence-aware preference optimization objective that contrasts faithful responses under varying visual evidence strengths to reduce hallucinations in MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28401","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs","primary_cat":"cs.CV","submitted_at":"2026-06-24T11:06:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25343","ref_index":142,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Native Multimodal Modeling: A Roadmap","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"T, I (GUI) Screenshot/UI-tree to action (tap, type, drag); covers mobile and OS environments. Embodied Interaction ALFWorld [ 135], BridgeData V2 [136], Open X-Embodiment [137], Magma [138] T, I, V (robot) Language-conditioned manipulation from visual observations and robot states. Align Hallucination & Faithfulness LLaV A-RLHF [139], RLHF-V [ 140], VLFeedback [ 141], RLAIF-V [ 142], HA-DPO [143], V-DPO [144] T, I Comparative or span-level feedback to reduce visual hallucinations; AI-assisted labels. Safety Alignment SPA-VL [145], Safe RLHF-V [146] T, I Safe/unsafe response pairs under multimodal harmful prompts. Generation Quality Preference ImageReward [147], Pick-a-Pic [148], HPS v2 [ 149], VBench [ 150], VBench++ [151] I, V Human preference scores for aesthetics, alignment,"},{"citing_arxiv_id":"2605.19663","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-19T10:57:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PStar adaptively selects pseudocode-based reasoning strategies via a Difficulty Feature Vector to reduce hallucinations in vision-language models, reporting SOTA results on POPE and MMStar benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10622","ref_index":128,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination","primary_cat":"cs.MM","submitted_at":"2026-05-11T14:16:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04641","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering","primary_cat":"cs.CV","submitted_at":"2026-05-06T08:32:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18512","ref_index":130,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T17:06:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.10016","ref_index":16,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training-Free Multimodal Large Language Model Orchestration","primary_cat":"cs.CL","submitted_at":"2025-08-06T16:17:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.12455","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mitigating Object Hallucinations via Sentence-Level Early Intervention","primary_cat":"cs.CV","submitted_at":"2025-07-16T17:55:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SENTINEL reduces MLLM object hallucinations by over 90% via sentence-level early intervention with detector-bootstrapped preference data and C-DPO loss, outperforming prior SOTA on hallucination and capability benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.06856","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2025-06-07T16:37:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.01785","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual-RFT: Visual Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2025-03-03T18:16:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Internlm-math: Open math large lan- guage models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024. 4 [42] Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. RlHF-V: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In CVPR, 2024. 4 [43] Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. RLAIF-V: Aligning mllms through open-source ai feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024. 4 [44] Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with mul-"},{"citing_arxiv_id":"2412.05271","ref_index":283,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","primary_cat":"cs.CV","submitted_at":"2024-12-06T18:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"It supports both English and Chinese versions, and we present the model's performance scores on the test set. MMBench v1.1[ 156]: Compared to MMBench, MMBench v1.1 features a refined dataset with a small number of noisy or low-quality questions removed, resulting in a subtle improvement in overall data quality. We report the model's performance on the English version of the test set. MMVet[ 283]: MMVet is a benchmark designed to assess the integrated capabilities of MLLMs on complex tasks. It evaluates six core competencies: recognition, knowledge, spatial awareness, language generation, OCR, and mathematics, across 16 integrated tasks. Note that VLMEvalKit uses GPT-4-Turbo as the scoring model for this benchmark, which yields slightly lower scores compared to the official evaluation server."},{"citing_arxiv_id":"2411.10442","ref_index":111,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","primary_cat":"cs.CL","submitted_at":"2024-11-15T18:59:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"to complete the remaining portion of the truncated answer without access to the image input. This generated com- pletion serves as the rejected answer for the paired sam- ple. Experimental results in Section 5.2 demonstrate that this straightforward method achieves comparable perfor- mance in reducing hallucinations compared to the divide- and-conquer method proposed in RLAIF-V [111]. In the correctness-based pipeline, multiple solutions to each ques- tion are sampled from InternVL2 series. Solutions match- ing the ground truth answer are used as chosen responses, while those that do not are used as rejected responses. Additionally, we propose the MPO method. The key insight behind this algorithm is that an effective PO pro- cess should enable the model to learn the relative prefer-"},{"citing_arxiv_id":"2408.01800","ref_index":112,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","primary_cat":"cs.CV","submitted_at":"2024-08-03T15:02:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"5 outperforms GPT-4V , Gemini Pro and Qwen-VL-Max on OCRBench. It also supports high-utility functions such as table-to-markdown conversion and full OCR content transcribtion. These are largely attributed to the 1.8M pixel high-resolution (e.g., 1344 × 1344) image perception technique across any aspect ratios [107]. • Trustworthy Behavior.Based on the RLAIF-V [112] and RLHF-V [111] techniques that align MLLM behaviors from AI/human feedback, MiniCPM-Llama3-V 2.5 exhibits more trustworthy behaviors, achieving lower hallucination rates than GPT-4V-1106 on Object HalBench. • Multilingual Support. Inspired by the findings from VisCPM [41], the integration of multilin- gual LLM significantly alleviates the heavy reliance on multimodal training data in low-resource"},{"citing_arxiv_id":"2404.18930","ref_index":199,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hallucination of Multimodal Large Language Models: A Survey","primary_cat":"cs.CV","submitted_at":"2024-04-29T17:59:41+00:00","verdict":"ACCEPT","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Statistic Bias Frequent Objects e.g.POPE [103] Objects Occurrence e.g.LURE [224], VCD [94] Hallucinationfrom Model (§3.2) Vision Model Information Loss e.g.HallusionBench [108], AMBER [160] Feature Bias e.g.Tonget al.[152] Language Model Parametric Knowledgee.g.VCD [94], Volcano [93] Cross-modal InterfaceInferior Alignment e.g.HACL [75], Halle-Switch [199] Hallucinationfrom Training (§3.3) Sequence Supervisione.g.MOCHa [11], OPERA [66] Visual Supervision e.g.Chenet al.[29] Human Feedback e.g.RLHF-V [193] Hallucination fromInference (§3.4) Visual Attention Deficiencye.g.OPERA [66], HaELM [161], M3ID [41] Trap Visual Tokens e.g.AvisC [170], VTI [114] HallucinationMetrics andBenchmarks(§4) Hallucination Metrics"}],"limit":50,"offset":0}