arxiv: 2305.10355 · v3 · submitted 2023-05-17 · 💻 cs.CV · cs.CL· cs.MM

Recognition: 2 theorem links

· Lean Theorem

Evaluating Object Hallucination in Large Vision-Language Models

Jinpeng Wang, Ji-Rong Wen, Kun Zhou, Wayne Xin Zhao, Yifan Du, Yifan Li

Pith reviewed 2026-05-11 13:38 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.MM

keywords object hallucinationlarge vision-language modelsevaluation methodPOPEvisual instructionsmultimodal generationimage captioning

0 comments

The pith

Large vision-language models often describe objects absent from the given image, especially those frequent in instructions or co-occurring with visible items.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines object hallucination in large vision-language models created by pairing strong language models with vision encoders. Experiments across representative models reveal that these systems generate image descriptions containing objects not present in the actual images. The authors observe that visual instructions bias the output toward objects that appear often in training data or alongside objects that are truly visible. Existing evaluation approaches vary with input phrasing and output style, so the paper introduces POPE, a polling-based query technique that measures hallucination more consistently across models.

Core claim

Large vision-language models suffer from severe object hallucination by generating objects inconsistent with the target images. Objects that frequently occur in the visual instructions or co-occur with the image objects are obviously prone to be hallucinated. Existing evaluation methods might be affected by the input instructions and generation styles of LVLMs, therefore a polling-based query method called POPE evaluates the object hallucination in a more stable and flexible way.

What carries the argument

POPE, a polling-based query method that asks the model yes/no questions about the presence of candidate objects in a fixed polling format to measure hallucination rates.

If this is right

Visual instructions should be designed to minimize exposure to frequent or co-occurring objects to lower hallucination rates.
POPE allows consistent ranking of different LVLMs on hallucination without dependence on their particular generation styles.
Models will continue to favor hallucinated objects that match patterns in their training instructions unless those patterns are altered.
Improved evaluation reveals specific objects most likely to be invented, guiding targeted fixes in training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hallucination may arise when language-model priors about common object co-occurrences override the actual visual signal.
POPE could be extended to probe other hallucination types such as attributes or relations beyond objects.
Widespread use of POPE on new models would let researchers track whether scaling or new training techniques actually reduce the problem.
If polling reveals systematic over-generation of certain object classes, retraining with balanced negative examples might help.

Load-bearing premise

The selected representative LVLMs and visual instruction datasets are sufficiently typical of the broader class of models, and the polling queries in POPE do not introduce new systematic biases in measuring hallucination.

What would settle it

Run POPE and prior evaluation methods on the same set of model outputs, then compare both against human judgments of object presence in the images; if POPE scores remain stable while prior scores shift with instruction wording, the claim holds.

read the original abstract

Inspired by the superior language abilities of large language models (LLM), large vision-language models (LVLM) have been recently explored by integrating powerful LLMs for improving the performance on complex multimodal tasks. Despite the promising progress on LVLMs, we find that LVLMs suffer from the hallucination problem, i.e. they tend to generate objects that are inconsistent with the target images in the descriptions. To investigate it, this work presents the first systematic study on object hallucination of LVLMs. We conduct the evaluation experiments on several representative LVLMs, and show that they mostly suffer from severe object hallucination issue. We further discuss that the visual instructions may influence the hallucination, and find that: objects that frequently occur in the visual instructions or co-occur with the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we find that existing evaluation methods might be affected by the input instructions and generation styles of LVLMs. Thus, we further design an improved evaluation method for object hallucination by proposing a polling-based query method called POPE. Experiment results demonstrate that our POPE can evaluate the object hallucination in a more stable and flexible way. Our codes and data are publicly available at https://github.com/RUCAIBox/POPE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LVLMs hallucinate objects often, especially frequent or co-occurring ones, and POPE polling looks like a practical fix but may not fully match free-form generation behavior.

read the letter

LVLMs hallucinate objects more than we want, and this paper documents that across a few models while linking it to how often things appear in training instructions or alongside real objects in images. They also introduce POPE, which polls the model with yes/no questions about specific objects to get a cleaner read on the issue. The work is new in being the first to focus systematically on object hallucination in these models. It does a decent job showing the problem is widespread and that frequency and co-occurrence matter. Releasing code and data is helpful too, and the idea of using polling to avoid generation-style biases makes sense on the surface. Where it could be stronger is on the claim that POPE is clearly better. The stress test raises a good point: direct questions might trigger different behavior than open-ended descriptions, like yes-bias from instruction tuning. If the paper doesn't have side-by-side comparisons that control for that, the superiority might be overstated. Also, the abstract is light on exact sample sizes, how they chose the objects for polling, and any stats on the differences they report. That leaves the central findings plausible but not fully locked down yet. This paper is for people building or evaluating vision-language models who need practical ways to measure reliability. A reader interested in benchmarks or multimodal safety will find it worth reading. It has enough substance and a clear contribution that it should go to peer review rather than get desk rejected. I'd recommend sending it out for review, with the expectation that reviewers will ask for more details on the POPE validation.

Referee Report

3 major / 3 minor

Summary. The paper conducts the first systematic study of object hallucination in large vision-language models (LVLMs). Experiments on representative LVLMs show severe hallucination, with objects frequent in visual instructions or co-occurring with image objects being especially prone to generation. Existing evaluation methods are critiqued for sensitivity to instructions and generation style, leading to the proposal of POPE, a polling-based yes/no query method claimed to offer more stable and flexible evaluation. Code and data are released publicly.

Significance. If the central empirical patterns and POPE evaluation hold, the work is significant for multimodal AI research: it quantifies a reliability issue in LVLMs that affects downstream tasks such as captioning and VQA, identifies actionable instruction-related biases, and supplies a practical polling protocol plus public resources for reproducible benchmarking. The explicit release of code and data is a clear strength that supports follow-on mitigation studies.

major comments (3)

[§3] §3 (Experiments): The manuscript does not report exact sample sizes (number of images and queries per model/dataset), controls for prompt variation, or statistical tests (e.g., confidence intervals or significance tests) used to establish the 'severe' hallucination rates and frequency/co-occurrence patterns; without these, the quantitative claims cannot be fully verified.
[§4] §4 (POPE): The claim that POPE evaluates the same underlying hallucination phenomenon as the free-form generation experiments rests on an untested premise; no side-by-side correlation analysis or ablation is presented showing that yes/no polling rates reproduce the frequency and co-occurrence effects observed in open-ended outputs, raising the possibility that POPE instead measures query compliance or yes-bias.
[§4.2] §4.2 (Comparison to prior methods): The superiority of POPE over existing metrics is asserted via stability and flexibility, yet the paper provides no quantitative metric (e.g., variance across prompt styles or inter-rater agreement) demonstrating reduced sensitivity; this is load-bearing for the central methodological contribution.

minor comments (3)

[Abstract] Abstract: Key quantitative findings (e.g., hallucination percentages per model) are omitted; adding one or two headline numbers would improve clarity.
Figure captions and tables: Ensure all axes and legends explicitly label hallucination rate versus object frequency or co-occurrence to avoid ambiguity in interpreting the reported patterns.
Notation: Define 'visual instructions' and 'co-occurrence' operationally in the main text on first use, as these terms are central to the claimed patterns.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our empirical findings and the justification for POPE. We address each major comment below and commit to revisions that strengthen the quantitative rigor and validation of our claims without altering the core contributions.

read point-by-point responses

Referee: [§3] §3 (Experiments): The manuscript does not report exact sample sizes (number of images and queries per model/dataset), controls for prompt variation, or statistical tests (e.g., confidence intervals or significance tests) used to establish the 'severe' hallucination rates and frequency/co-occurrence patterns; without these, the quantitative claims cannot be fully verified.

Authors: We agree that explicit reporting of sample sizes, prompt controls, and uncertainty estimates is necessary for verifiability. The experiments used the full COCO val2014 set (approximately 40k images) for the primary frequency and co-occurrence analyses across models, with 5k-image subsets for efficiency in some LVLM evaluations and 1k random samples for instruction-variation ablations; all queries per image were fixed to the same template to control prompt variation. In the revision we will add a dedicated table listing exact image counts and query counts per model and dataset, state that a single fixed prompt template was used across all models for the main results, and report bootstrap 95% confidence intervals on the reported hallucination percentages and co-occurrence correlations to substantiate the severity claims. revision: yes
Referee: [§4] §4 (POPE): The claim that POPE evaluates the same underlying hallucination phenomenon as the free-form generation experiments rests on an untested premise; no side-by-side correlation analysis or ablation is presented showing that yes/no polling rates reproduce the frequency and co-occurrence effects observed in open-ended outputs, raising the possibility that POPE instead measures query compliance or yes-bias.

Authors: The design of POPE deliberately decouples object hallucination measurement from open-ended generation style by using balanced yes/no queries, but we acknowledge that we did not present a direct quantitative link showing that the same frequency and co-occurrence biases appear under POPE. In the revised manuscript we will add a new analysis that computes per-object hallucination rates under both free-form captioning and POPE on the same set of images and models, reports Pearson correlations between the two, and includes an ablation that balances positive/negative object queries to quantify any yes-bias. This will either confirm that POPE reproduces the key patterns or allow us to clarify the precise relationship between the two evaluation regimes. revision: yes
Referee: [§4.2] §4.2 (Comparison to prior methods): The superiority of POPE over existing metrics is asserted via stability and flexibility, yet the paper provides no quantitative metric (e.g., variance across prompt styles or inter-rater agreement) demonstrating reduced sensitivity; this is load-bearing for the central methodological contribution.

Authors: We presented qualitative evidence that prior metrics vary with instruction phrasing and generation length while POPE remains consistent, but we did not supply a quantitative stability metric such as variance across prompt variants. We will revise §4.2 to include a controlled ablation that applies five distinct prompt phrasings to the same images and models, computes the standard deviation of hallucination rates for POPE versus CHAIR and other baselines, and reports these variance numbers together with the original stability claims. This will provide the requested quantitative support for reduced sensitivity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper performs direct empirical evaluations of object hallucination on representative LVLMs using public datasets and existing methods, then proposes POPE as a polling-based alternative. No derivations, fitted parameters, or predictions reduce by construction to the paper's own inputs or self-citations. Claims about hallucination severity, frequency effects, and POPE's stability rest on experimental comparisons rather than self-referential definitions or load-bearing self-citations. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that object hallucination can be reliably detected via object presence checks and that the tested models and instructions represent the general behavior of LVLMs.

axioms (1)

domain assumption Object hallucination is defined as generating objects inconsistent with the target images in the descriptions.
This definition underpins all evaluation experiments and the design of POPE.

pith-pipeline@v0.9.0 · 5546 in / 1208 out tokens · 52159 ms · 2026-05-11T13:38:09.947713+00:00 · methodology

discussion (0)

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
cs.CV 2026-05 unverdicted novelty 7.0

CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
cs.CV 2026-05 unverdicted novelty 7.0

Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 conditional novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
Improving Vision-language Models with Perception-centric Process Reward Models
cs.CV 2026-04 unverdicted novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
cs.CR 2026-04 unverdicted novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
cs.CV 2026-04 conditional novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
cs.CV 2026-04 unverdicted novelty 7.0

ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
cs.CV 2026-05 unverdicted novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
Online Self-Calibration Against Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...
X2SAM: Any Segmentation in Images and Videos
cs.CV 2026-04 unverdicted novelty 6.0

X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.
Mitigating Multimodal Hallucination via Phase-wise Self-reward
cs.CV 2026-04 unverdicted novelty 6.0

PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
cs.CV 2026-04 unverdicted novelty 6.0

CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to...
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
cs.AI 2026-04 unverdicted novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

HaloProbe introduces a Bayesian method to estimate token-level hallucination probabilities in VLMs by factorizing external and internal signals, enabling more effective mitigation than intervention-based techniques wh...
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
cs.CV 2023-06 accept novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Perceptual Flow Network for Visually Grounded Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid
cs.AI 2026-05 unverdicted novelty 5.0

A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
cs.CL 2026-04 conditional novelty 5.0

Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
cs.CV 2026-04 unverdicted novelty 5.0

MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing
cs.CV 2026-04 unverdicted novelty 5.0

VCE mitigates object hallucination in LVLMs by decomposing activation patterns from contrastive visual inputs via SVD to suppress hallucination subspaces through targeted parameter edits.
AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation
cs.CV 2026-04 unverdicted novelty 5.0

AutoVQA-G is a self-improving framework that generates VQA-G datasets with higher visual grounding accuracy than leading multimodal LLMs via iterative CoT verification and prompt refinement.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation
cs.CV 2026-04 unverdicted novelty 5.0

DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
cs.CV 2026-04 unverdicted novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
cs.AI 2024-03 unverdicted novelty 4.0

DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
Improved Baselines with Visual Instruction Tuning
cs.CV 2023-10 conditional novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 43 Pith papers · 9 internal anchors

[1]

Harsh Agrawal, Peter Anderson, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. https://doi.org/10.1109/ICCV.2019.00904 nocaps: novel object captioning at scale . In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 , pa...

work page doi:10.1109/iccv.2019.00904 2019
[2]

Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Kar \' e n Simonyan

Jean - Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Bin...

work page 2022
[3]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015 a . VQA: visual question answering. In ICCV , pages 2425--2433. IEEE Computer Society

work page 2015
[4]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015 b . https://doi.org/10.1109/ICCV.2015.279 VQA: visual question answering . In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015 , pages 2425--2433. IEEE Computer Society

work page doi:10.1109/iccv.2015.279 2015
[5]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

KevinGBecker,KathleenCBarnes,TiffaniJBright, and S Alex Wang

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. https://doi.org/10.48550/arXiv.2302.04023 A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity . CoRR, abs/2302.04023

work page doi:10.48550/arxiv.2302.04023 2023
[7]

Ali Furkan Biten, Llu \' s G \' o mez, and Dimosthenis Karatzas. 2022. https://doi.org/10.1109/WACV51458.2022.00253 Let there be a clock on the beach: Reducing object hallucination in image captioning . In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022 , pages 2473--2482. IEEE

work page doi:10.1109/wacv51458.2022.00253 2022
[8]

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023. Shikra: Unleashing multimodal llm's referential dialogue magic. CoRR, abs/2306.15195

work page internal anchor Pith review arXiv 2023
[9]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. https://lmsys.org/blog/2023-03-30-vicuna/ Vicuna: An open-source chatbot impressing gpt-4 with 90\

work page 2023
[10]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023 a . Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500

work page internal anchor Pith review arXiv 2023
[11]

Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. 2023 b . https://aclanthology.org/2023.eacl-main.156 Plausible may not be faithful: Probing object hallucination in vision-language pre-training . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-...

work page 2023
[12]

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. 2022. https://doi.org/10.1561/0600000105 Vision-language pre-training: Basics, recent advances, and future trends . Found. Trends Comput. Graph. Vis., 14(3-4):163--352

work page doi:10.1561/0600000105 2022
[13]

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. 2023. https://doi.org/10.48550/arXiv.2304.15010 Llama-adapter V2: parameter-efficient visual instruction model . CoRR, abs/2304.15010

work page doi:10.48550/arxiv.2304.15010 2023
[14]

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790

work page arXiv 2023
[15]

Yash Goyal, Tejas Khot, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. 2017. https://doi.org/10.1109/CVPR.2017.670 Making the V in VQA matter: Elevating the role of image understanding in visual question answering . In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 6325--6334....

work page doi:10.1109/cvpr.2017.670 2017
[16]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2015. Framing image description as a ranking task: Data, models and evaluation metrics (extended abstract). In IJCAI , pages 4188--4192. AAAI Press

work page 2015
[17]

Yi - Chong Huang, Xia - Chong Feng, Xiao - Cheng Feng, and Bing Qin. 2021. http://arxiv.org/abs/2104.14839 The factual inconsistency problem in abstractive text summarization: A survey . CoRR, abs/2104.14839

work page arXiv 2021
[18]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. 2019. https://doi.org/10.1109/CVPR.2019.00686 GQA: A new dataset for real-world visual reasoning and compositional question answering . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 6700--6709. Computer Vision Foundation / IEEE

work page doi:10.1109/cvpr.2019.00686 2019
[19]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. 2022. http://arxiv.org/abs/2202.03629 Survey of hallucination in natural language generation . CoRR, abs/2202.03629

work page arXiv 2022
[20]

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023 a . https://doi.org/10.48550/arXiv.2305.03726 Otter: A multi-modal model with in-context instruction tuning . CoRR, abs/2305.03726

work page doi:10.48550/arxiv.2305.03726 2023
[21]

Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven C. H. Hoi. 2022 a . https://doi.org/10.48550/arXiv.2209.09019 LAVIS: A library for language-vision intelligence . CoRR, abs/2209.09019

work page doi:10.48550/arxiv.2209.09019 2022
[22]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023 b . https://doi.org/10.48550/arXiv.2301.12597 BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models . CoRR, abs/2301.12597

work page internal anchor Pith review doi:10.48550/arxiv.2301.12597 2023
[23]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022 b . https://proceedings.mlr.press/v162/li22n.html BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation . In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , volume 162 of Proceedings of Mac...

work page 2022
[24]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. https://doi.org/10.1007/978-3-030-58577-8\_8 Oscar: Object-semantics aligned pre-training for vision-language tasks . In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2...

work page doi:10.1007/978-3-030-58577-8 2020
[25]

Microsoft COCO: common objects in context, in: Computer Vision - ECCV 2014 - 13th European Confer- ence, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, Springer

Tsung - Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \' a r, and C. Lawrence Zitnick. 2014. https://doi.org/10.1007/978-3-319-10602-1\_48 Microsoft COCO: common objects in context . In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V , vo...

work page doi:10.1007/978-3-319-10602-1 2014
[26]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. https://doi.org/10.48550/arXiv.2304.08485 Visual instruction tuning . CoRR, abs/2304.08485

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023
[27]

Bennett, Meredith Ringel Morris, and Edward Cutrell

Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. https://doi.org/10.1145/3025453.3025814 Understanding blind people's experiences with computer-generated captions of social media images . In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, May 06-11, 2017 , pages 5988--5999. ACM

work page doi:10.1145/3025453.3025814 2017
[28]

OpenAI. 2023. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 technical report . CoRR, abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[29]

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In NIPS , pages 1143--1151

work page 2011
[30]

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. https://doi.org/10.18653/v1/d18-1437 Object hallucination in image captioning . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4035--4045. Association for Computational...

work page doi:10.18653/v1/d18-1437 2018
[31]

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. https://doi.org/10.1007/978-3-031-20074-8\_9 A-OKVQA: A benchmark for visual question answering using world knowledge . In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VIII , volume 13668 of ...

work page doi:10.1007/978-3-031-20074-8 2022
[32]

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL (1) , pages 2556--2565. Association for Computational Linguistics

work page 2018
[33]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie - Anne Lachaux, Timoth \' e e Lacroix, Baptiste Rozi \` e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aur \' e lien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. https://proceedings.mlr.press/v162/wang22al.html OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework . In International Conference on Machine Learning, ICML 2022, 17-23 July ...

work page 2022
[35]

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. 2023. https://doi.org/10.48550/arXiv.2304.14178 mplug-owl: Modularization empowers large language models with multimodality . CoRR, abs/2304.14178

work page Pith review doi:10.48550/arxiv.2304.14178 2023
[36]

Peng Zhang, Yash Goyal, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In CVPR , pages 5014--5022. IEEE Computer Society

work page 2016
[37]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. https://doi.org/10.1109/CVPR46437.2021.00553 Vinvl: Revisiting visual representations in vision-language models . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 5579--5588. Computer ...

work page doi:10.1109/cvpr46437.2021.00553 2021
[38]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian - Yun Nie, and Ji - Rong Wen. 2023. https://doi.org/10.48550/arXiv.2303.18223 A survey of large langua...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.18223 2023
[39]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. https://doi.org/10.48550/arXiv.2304.10592 Minigpt-4: Enhancing vision-language understanding with advanced large language models . CoRR, abs/2304.10592

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.10592 2023
[40]

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. 2023. https://doi.org/10.48550/arXiv.2304.06718 Segment everything everywhere all at once . CoRR, abs/2304.06718

work page doi:10.48550/arxiv.2304.06718 2023