Recognition: 2 theorem links
· Lean TheoremEvaluating Object Hallucination in Large Vision-Language Models
Pith reviewed 2026-05-11 13:38 UTC · model grok-4.3
The pith
Large vision-language models often describe objects absent from the given image, especially those frequent in instructions or co-occurring with visible items.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large vision-language models suffer from severe object hallucination by generating objects inconsistent with the target images. Objects that frequently occur in the visual instructions or co-occur with the image objects are obviously prone to be hallucinated. Existing evaluation methods might be affected by the input instructions and generation styles of LVLMs, therefore a polling-based query method called POPE evaluates the object hallucination in a more stable and flexible way.
What carries the argument
POPE, a polling-based query method that asks the model yes/no questions about the presence of candidate objects in a fixed polling format to measure hallucination rates.
If this is right
- Visual instructions should be designed to minimize exposure to frequent or co-occurring objects to lower hallucination rates.
- POPE allows consistent ranking of different LVLMs on hallucination without dependence on their particular generation styles.
- Models will continue to favor hallucinated objects that match patterns in their training instructions unless those patterns are altered.
- Improved evaluation reveals specific objects most likely to be invented, guiding targeted fixes in training data.
Where Pith is reading between the lines
- Hallucination may arise when language-model priors about common object co-occurrences override the actual visual signal.
- POPE could be extended to probe other hallucination types such as attributes or relations beyond objects.
- Widespread use of POPE on new models would let researchers track whether scaling or new training techniques actually reduce the problem.
- If polling reveals systematic over-generation of certain object classes, retraining with balanced negative examples might help.
Load-bearing premise
The selected representative LVLMs and visual instruction datasets are sufficiently typical of the broader class of models, and the polling queries in POPE do not introduce new systematic biases in measuring hallucination.
What would settle it
Run POPE and prior evaluation methods on the same set of model outputs, then compare both against human judgments of object presence in the images; if POPE scores remain stable while prior scores shift with instruction wording, the claim holds.
read the original abstract
Inspired by the superior language abilities of large language models (LLM), large vision-language models (LVLM) have been recently explored by integrating powerful LLMs for improving the performance on complex multimodal tasks. Despite the promising progress on LVLMs, we find that LVLMs suffer from the hallucination problem, i.e. they tend to generate objects that are inconsistent with the target images in the descriptions. To investigate it, this work presents the first systematic study on object hallucination of LVLMs. We conduct the evaluation experiments on several representative LVLMs, and show that they mostly suffer from severe object hallucination issue. We further discuss that the visual instructions may influence the hallucination, and find that: objects that frequently occur in the visual instructions or co-occur with the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we find that existing evaluation methods might be affected by the input instructions and generation styles of LVLMs. Thus, we further design an improved evaluation method for object hallucination by proposing a polling-based query method called POPE. Experiment results demonstrate that our POPE can evaluate the object hallucination in a more stable and flexible way. Our codes and data are publicly available at https://github.com/RUCAIBox/POPE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts the first systematic study of object hallucination in large vision-language models (LVLMs). Experiments on representative LVLMs show severe hallucination, with objects frequent in visual instructions or co-occurring with image objects being especially prone to generation. Existing evaluation methods are critiqued for sensitivity to instructions and generation style, leading to the proposal of POPE, a polling-based yes/no query method claimed to offer more stable and flexible evaluation. Code and data are released publicly.
Significance. If the central empirical patterns and POPE evaluation hold, the work is significant for multimodal AI research: it quantifies a reliability issue in LVLMs that affects downstream tasks such as captioning and VQA, identifies actionable instruction-related biases, and supplies a practical polling protocol plus public resources for reproducible benchmarking. The explicit release of code and data is a clear strength that supports follow-on mitigation studies.
major comments (3)
- [§3] §3 (Experiments): The manuscript does not report exact sample sizes (number of images and queries per model/dataset), controls for prompt variation, or statistical tests (e.g., confidence intervals or significance tests) used to establish the 'severe' hallucination rates and frequency/co-occurrence patterns; without these, the quantitative claims cannot be fully verified.
- [§4] §4 (POPE): The claim that POPE evaluates the same underlying hallucination phenomenon as the free-form generation experiments rests on an untested premise; no side-by-side correlation analysis or ablation is presented showing that yes/no polling rates reproduce the frequency and co-occurrence effects observed in open-ended outputs, raising the possibility that POPE instead measures query compliance or yes-bias.
- [§4.2] §4.2 (Comparison to prior methods): The superiority of POPE over existing metrics is asserted via stability and flexibility, yet the paper provides no quantitative metric (e.g., variance across prompt styles or inter-rater agreement) demonstrating reduced sensitivity; this is load-bearing for the central methodological contribution.
minor comments (3)
- [Abstract] Abstract: Key quantitative findings (e.g., hallucination percentages per model) are omitted; adding one or two headline numbers would improve clarity.
- Figure captions and tables: Ensure all axes and legends explicitly label hallucination rate versus object frequency or co-occurrence to avoid ambiguity in interpreting the reported patterns.
- Notation: Define 'visual instructions' and 'co-occurrence' operationally in the main text on first use, as these terms are central to the claimed patterns.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help clarify the presentation of our empirical findings and the justification for POPE. We address each major comment below and commit to revisions that strengthen the quantitative rigor and validation of our claims without altering the core contributions.
read point-by-point responses
-
Referee: [§3] §3 (Experiments): The manuscript does not report exact sample sizes (number of images and queries per model/dataset), controls for prompt variation, or statistical tests (e.g., confidence intervals or significance tests) used to establish the 'severe' hallucination rates and frequency/co-occurrence patterns; without these, the quantitative claims cannot be fully verified.
Authors: We agree that explicit reporting of sample sizes, prompt controls, and uncertainty estimates is necessary for verifiability. The experiments used the full COCO val2014 set (approximately 40k images) for the primary frequency and co-occurrence analyses across models, with 5k-image subsets for efficiency in some LVLM evaluations and 1k random samples for instruction-variation ablations; all queries per image were fixed to the same template to control prompt variation. In the revision we will add a dedicated table listing exact image counts and query counts per model and dataset, state that a single fixed prompt template was used across all models for the main results, and report bootstrap 95% confidence intervals on the reported hallucination percentages and co-occurrence correlations to substantiate the severity claims. revision: yes
-
Referee: [§4] §4 (POPE): The claim that POPE evaluates the same underlying hallucination phenomenon as the free-form generation experiments rests on an untested premise; no side-by-side correlation analysis or ablation is presented showing that yes/no polling rates reproduce the frequency and co-occurrence effects observed in open-ended outputs, raising the possibility that POPE instead measures query compliance or yes-bias.
Authors: The design of POPE deliberately decouples object hallucination measurement from open-ended generation style by using balanced yes/no queries, but we acknowledge that we did not present a direct quantitative link showing that the same frequency and co-occurrence biases appear under POPE. In the revised manuscript we will add a new analysis that computes per-object hallucination rates under both free-form captioning and POPE on the same set of images and models, reports Pearson correlations between the two, and includes an ablation that balances positive/negative object queries to quantify any yes-bias. This will either confirm that POPE reproduces the key patterns or allow us to clarify the precise relationship between the two evaluation regimes. revision: yes
-
Referee: [§4.2] §4.2 (Comparison to prior methods): The superiority of POPE over existing metrics is asserted via stability and flexibility, yet the paper provides no quantitative metric (e.g., variance across prompt styles or inter-rater agreement) demonstrating reduced sensitivity; this is load-bearing for the central methodological contribution.
Authors: We presented qualitative evidence that prior metrics vary with instruction phrasing and generation length while POPE remains consistent, but we did not supply a quantitative stability metric such as variance across prompt variants. We will revise §4.2 to include a controlled ablation that applies five distinct prompt phrasings to the same images and models, computes the standard deviation of hallucination rates for POPE versus CHAIR and other baselines, and reports these variance numbers together with the original stability claims. This will provide the requested quantitative support for reduced sensitivity. revision: yes
Circularity Check
No significant circularity; empirical evaluation is self-contained
full rationale
The paper performs direct empirical evaluations of object hallucination on representative LVLMs using public datasets and existing methods, then proposes POPE as a polling-based alternative. No derivations, fitted parameters, or predictions reduce by construction to the paper's own inputs or self-citations. Claims about hallucination severity, frequency effects, and POPE's stability rest on experimental comparisons rather than self-referential definitions or load-bearing self-citations. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Object hallucination is defined as generating objects inconsistent with the target images in the descriptions.
Forward citations
Cited by 43 Pith papers
-
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
-
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
-
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
-
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...
-
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
-
Improving Vision-language Models with Perception-centric Process Reward Models
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
-
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
-
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference
ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
-
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
-
Online Self-Calibration Against Hallucination in Vision-Language Models
OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...
-
X2SAM: Any Segmentation in Images and Videos
X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.
-
Mitigating Multimodal Hallucination via Phase-wise Self-reward
PSRD mitigates visual hallucinations in LVLMs via phase-wise self-reward decoding, cutting rates by 50% on LLaVA-1.5-7B and outperforming prior methods on five benchmarks.
-
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
-
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to...
-
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
-
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
-
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
-
HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models
HaloProbe introduces a Bayesian method to estimate token-level hallucination probabilities in VLMs by factorizing external and internal signals, enabling more effective mitigation than intervention-based techniques wh...
-
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid
A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.
-
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
-
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
-
VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing
VCE mitigates object hallucination in LVLMs by decomposing activation patterns from contrastive visual inputs via SVD to suppress hallucination subspaces through targeted parameter edits.
-
AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation
AutoVQA-G is a self-improving framework that generates VQA-G datasets with higher visual grounding accuracy than leading multimodal LLMs via iterative CoT verification and prompt refinement.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation
DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
-
Improved Baselines with Visual Instruction Tuning
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Reference graph
Works this paper leans on
-
[1]
Harsh Agrawal, Peter Anderson, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. https://doi.org/10.1109/ICCV.2019.00904 nocaps: novel object captioning at scale . In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019 , pa...
-
[2]
Jean - Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Bin...
work page 2022
-
[3]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015 a . VQA: visual question answering. In ICCV , pages 2425--2433. IEEE Computer Society
work page 2015
-
[4]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015 b . https://doi.org/10.1109/ICCV.2015.279 VQA: visual question answering . In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015 , pages 2425--2433. IEEE Computer Society
-
[5]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
KevinGBecker,KathleenCBarnes,TiffaniJBright, and S Alex Wang
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. https://doi.org/10.48550/arXiv.2302.04023 A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity . CoRR, abs/2302.04023
-
[7]
Ali Furkan Biten, Llu \' s G \' o mez, and Dimosthenis Karatzas. 2022. https://doi.org/10.1109/WACV51458.2022.00253 Let there be a clock on the beach: Reducing object hallucination in image captioning . In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022 , pages 2473--2482. IEEE
-
[8]
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023. Shikra: Unleashing multimodal llm's referential dialogue magic. CoRR, abs/2306.15195
work page internal anchor Pith review arXiv 2023
-
[9]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. https://lmsys.org/blog/2023-03-30-vicuna/ Vicuna: An open-source chatbot impressing gpt-4 with 90\
work page 2023
-
[10]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023 a . Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500
work page internal anchor Pith review arXiv 2023
-
[11]
Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. 2023 b . https://aclanthology.org/2023.eacl-main.156 Plausible may not be faithful: Probing object hallucination in vision-language pre-training . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-...
work page 2023
-
[12]
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. 2022. https://doi.org/10.1561/0600000105 Vision-language pre-training: Basics, recent advances, and future trends . Found. Trends Comput. Graph. Vis., 14(3-4):163--352
-
[13]
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. 2023. https://doi.org/10.48550/arXiv.2304.15010 Llama-adapter V2: parameter-efficient visual instruction model . CoRR, abs/2304.15010
- [14]
-
[15]
Yash Goyal, Tejas Khot, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. 2017. https://doi.org/10.1109/CVPR.2017.670 Making the V in VQA matter: Elevating the role of image understanding in visual question answering . In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 6325--6334....
-
[16]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2015. Framing image description as a ranking task: Data, models and evaluation metrics (extended abstract). In IJCAI , pages 4188--4192. AAAI Press
work page 2015
- [17]
-
[18]
Drew A. Hudson and Christopher D. Manning. 2019. https://doi.org/10.1109/CVPR.2019.00686 GQA: A new dataset for real-world visual reasoning and compositional question answering . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 6700--6709. Computer Vision Foundation / IEEE
- [19]
-
[20]
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023 a . https://doi.org/10.48550/arXiv.2305.03726 Otter: A multi-modal model with in-context instruction tuning . CoRR, abs/2305.03726
-
[21]
Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven C. H. Hoi. 2022 a . https://doi.org/10.48550/arXiv.2209.09019 LAVIS: A library for language-vision intelligence . CoRR, abs/2209.09019
-
[22]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023 b . https://doi.org/10.48550/arXiv.2301.12597 BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models . CoRR, abs/2301.12597
work page internal anchor Pith review doi:10.48550/arxiv.2301.12597 2023
-
[23]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022 b . https://proceedings.mlr.press/v162/li22n.html BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation . In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , volume 162 of Proceedings of Mac...
work page 2022
-
[24]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. https://doi.org/10.1007/978-3-030-58577-8\_8 Oscar: Object-semantics aligned pre-training for vision-language tasks . In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2...
-
[25]
Tsung - Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \' a r, and C. Lawrence Zitnick. 2014. https://doi.org/10.1007/978-3-319-10602-1\_48 Microsoft COCO: common objects in context . In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V , vo...
-
[26]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. https://doi.org/10.48550/arXiv.2304.08485 Visual instruction tuning . CoRR, abs/2304.08485
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023
-
[27]
Bennett, Meredith Ringel Morris, and Edward Cutrell
Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. https://doi.org/10.1145/3025453.3025814 Understanding blind people's experiences with computer-generated captions of social media images . In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, May 06-11, 2017 , pages 5988--5999. ACM
-
[28]
OpenAI. 2023. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 technical report . CoRR, abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[29]
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In NIPS , pages 1143--1151
work page 2011
-
[30]
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. https://doi.org/10.18653/v1/d18-1437 Object hallucination in image captioning . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4035--4045. Association for Computational...
-
[31]
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. https://doi.org/10.1007/978-3-031-20074-8\_9 A-OKVQA: A benchmark for visual question answering using world knowledge . In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VIII , volume 13668 of ...
-
[32]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL (1) , pages 2556--2565. Association for Computational Linguistics
work page 2018
-
[33]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie - Anne Lachaux, Timoth \' e e Lacroix, Baptiste Rozi \` e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aur \' e lien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. https://proceedings.mlr.press/v162/wang22al.html OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework . In International Conference on Machine Learning, ICML 2022, 17-23 July ...
work page 2022
-
[35]
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. 2023. https://doi.org/10.48550/arXiv.2304.14178 mplug-owl: Modularization empowers large language models with multimodality . CoRR, abs/2304.14178
-
[36]
Peng Zhang, Yash Goyal, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In CVPR , pages 5014--5022. IEEE Computer Society
work page 2016
-
[37]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. https://doi.org/10.1109/CVPR46437.2021.00553 Vinvl: Revisiting visual representations in vision-language models . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 5579--5588. Computer ...
-
[38]
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian - Yun Nie, and Ji - Rong Wen. 2023. https://doi.org/10.48550/arXiv.2303.18223 A survey of large langua...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.18223 2023
-
[39]
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. https://doi.org/10.48550/arXiv.2304.10592 Minigpt-4: Enhancing vision-language understanding with advanced large language models . CoRR, abs/2304.10592
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.10592 2023
-
[40]
Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. 2023. https://doi.org/10.48550/arXiv.2304.06718 Segment everything everywhere all at once . CoRR, abs/2304.06718
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.