citation dossier

Aligning large multi-modal model with robust instruction tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang · 2023 · arXiv 2306.14565

20Pith papers citing it

20reference links

cs.CVtop field · 17 papers

UNVERDICTEDtop verdict bucket · 17 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 20 reviewed papers. Its strongest current cluster is cs.CV (17 papers). The largest review-status bucket among citing papers is UNVERDICTED (17 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

GEASS selectively gates and weights self-generated captions using confidence and entropy to reduce object hallucinations in VLMs, outperforming vanilla inference and contrastive decoding on POPE and HallusionBench.

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

cs.CV · 2026-04-28 · conditional · novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregressive decoding.

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.

Online Self-Calibration Against Hallucination in Vision-Language Models

cs.CV · 2026-05-01 · unverdicted · novelty 6.0

OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.

State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.

ReflectCAP: Detailed Image Captioning with Reflective Memory

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

cs.CV · 2023-06-23 · unverdicted · novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 5.0

A self-captioning method using a Multimodal Interaction Gate amplifies redundant interactions to reduce visual-induced errors by 38.3% and improve consistency by 16.8% in vision-language models.

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

cs.AI · 2026-04-11 · unverdicted · novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.

Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.

LLaVA-OneVision: Easy Visual Task Transfer

cs.CV · 2024-08-06 · unverdicted · novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

cs.CV · 2024-08-03 · conditional · novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

Hallucination of Multimodal Large Language Models: A Survey

cs.CV · 2024-04-29 · accept · novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

cs.CL · 2023-11-09 · unverdicted · novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

Delineating Knowledge Boundaries for Honest Large Vision-Language Models

cs.CV · 2026-04-29 · unverdicted · novelty 4.0

VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

cs.CV · 2025-01-22 · unverdicted · novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

cs.CV · 2024-04-25 · unverdicted · novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

A Survey on Hallucination in Large Vision-Language Models

cs.CV · 2024-02-01 · unverdicted · novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

citing papers explorer

Showing 20 of 20 citing papers.

GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models cs.CV · 2026-05-03 · unverdicted · none · ref 6
GEASS selectively gates and weights self-generated captions using confidence and entropy to reduce object hallucinations in VLMs, outperforming vanilla inference and contrastive decoding on POPE and HallusionBench.
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models cs.CV · 2026-04-28 · conditional · none · ref 25
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregressive decoding.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering cs.CV · 2026-05-06 · unverdicted · none · ref 100
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.
Online Self-Calibration Against Hallucination in Vision-Language Models cs.CV · 2026-05-01 · unverdicted · none · ref 18
OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading cs.CV · 2026-04-29 · unverdicted · none · ref 26
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.
ReflectCAP: Detailed Image Captioning with Reflective Memory cs.AI · 2026-04-14 · unverdicted · none · ref 19
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 148
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models cs.CV · 2023-06-23 · unverdicted · none · ref 29
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models cs.CV · 2026-05-03 · unverdicted · none · ref 21
A self-captioning method using a Multimodal Interaction Gate amplifies redundant interactions to reduce visual-induced errors by 38.3% and improve consistency by 16.8% in vision-language models.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unverdicted · none · ref 23
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by reinforcing visual attention.
Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction cs.CV · 2026-04-09 · unverdicted · none · ref 31
MESA reduces hallucinations in LVLMs via controlled selective latent intervention that preserves the original token distribution.
LLaVA-OneVision: Easy Visual Task Transfer cs.CV · 2024-08-06 · unverdicted · none · ref 80
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone cs.CV · 2024-08-03 · conditional · none · ref 60
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 113
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 91
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 192
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Delineating Knowledge Boundaries for Honest Large Vision-Language Models cs.CV · 2026-04-29 · unverdicted · none · ref 14
VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 86
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 60
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
A Survey on Hallucination in Large Vision-Language Models cs.CV · 2024-02-01 · unverdicted · none · ref 28
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Aligning large multi-modal model with robust instruction tuning

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer