Recognition: 1 theorem link
Improved Baselines with Visual Instruction Tuning
Pith reviewed 2026-05-12 19:05 UTC · model grok-4.3
The pith
Simple modifications to LLaVA produce stronger baselines that lead on 11 visual instruction benchmarks using only 1.2 million public examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The fully-connected vision-language cross-modal connector inside LLaVA is surprisingly powerful and data-efficient. With the straightforward substitutions of CLIP-ViT-L-336px as the vision encoder, an MLP projection layer as the connector, and the addition of academic-task-oriented VQA data formatted with simple response prompts, the resulting models establish new state-of-the-art numbers on eleven different benchmarks while training on just 1.2 million public images and completing full training in about one day on a single 8-A100 node.
What carries the argument
The fully-connected vision-language cross-modal connector, implemented as an MLP projection layer that maps visual features from the CLIP encoder into the language model's embedding space.
Load-bearing premise
The reported gains are caused by the listed changes to the vision encoder, connector, and training data rather than by differences in training procedure, data cleaning steps, or evaluation details that are not described.
What would settle it
Retraining the original LLaVA architecture with the same new data mixture and formatting prompts but keeping the smaller CLIP encoder and linear connector, then measuring whether the performance gap largely disappears.
read the original abstract
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that simple modifications to LLaVA—specifically, replacing the vision-language connector with an MLP and using the higher-resolution CLIP-ViT-L-336px encoder, plus supplementing training with academic VQA data under simple response formatting prompts—produce stronger baselines that achieve state-of-the-art results across 11 benchmarks. The final 13B model is trained on only 1.2M public examples and completes full training in roughly one day on a single 8-A100 node.
Significance. If the reported gains hold under controlled conditions, the work is significant for demonstrating that competitive LMM performance is achievable with modest public data and compute, thereby lowering barriers to entry. The public release of code and models is a clear strength that supports reproducibility and future baseline comparisons in visual instruction tuning.
major comments (1)
- [Experiments] The central attribution—that the listed modifications (CLIP-ViT-L-336px + MLP connector and added academic VQA data) are responsible for the SOTA results—requires isolating ablations. The experimental comparisons do not hold all other variables (optimizer schedule, data filtering, response formatting details beyond the stated prompts, or evaluation protocol) fixed while toggling only these two changes, leaving open the possibility that unstated factors contribute to the lift.
minor comments (2)
- A consolidated table listing all 11 benchmarks with exact metrics for the proposed model versus prior baselines would improve clarity and allow direct verification of the SOTA claim.
- [Method] The MLP projection architecture (layer count, hidden dimensions) and the precise composition of the 1.2M training mixture should be specified in the method section for full reproducibility.
Simulated Author's Rebuttal
We thank the referee for their careful review and for recognizing the significance of our work in making strong LMM baselines more accessible. We provide a point-by-point response to the major comment below.
read point-by-point responses
-
Referee: [Experiments] The central attribution—that the listed modifications (CLIP-ViT-L-336px + MLP connector and added academic VQA data) are responsible for the SOTA results—requires isolating ablations. The experimental comparisons do not hold all other variables (optimizer schedule, data filtering, response formatting details beyond the stated prompts, or evaluation protocol) fixed while toggling only these two changes, leaving open the possibility that unstated factors contribute to the lift.
Authors: We agree with the referee that isolating the effects of each modification through controlled ablations would provide stronger evidence for our claims. The original manuscript presents comparisons of the full improved model against prior work, but does not include exhaustive ablations that hold all other factors constant. In the revised version, we will add new experiments that fix the training recipe (optimizer, schedule, data filtering, prompts, and evaluation) and vary only the vision encoder resolution, the connector architecture, and the addition of academic VQA data. These ablations will clarify the contribution of each change. revision: yes
Circularity Check
No circularity: empirical baseline improvements rest on training runs and benchmarks
full rationale
The paper reports results from training LLaVA variants with stated modifications (CLIP-ViT-L-336px + MLP connector, added academic VQA data, simple prompts) and evaluates them on 11 public benchmarks. No derivation chain, equations, or first-principles predictions exist that could reduce to self-defined quantities or fitted inputs by construction. Self-citations to prior LLaVA work supply background but do not serve as load-bearing uniqueness theorems or ansatzes; the central claims are new empirical numbers from explicit training procedures using 1.2M public data points. The work is self-contained against external benchmarks and does not rename known patterns or smuggle assumptions via citation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard machine learning assumptions on data representativeness and model generalization to benchmarks
Forward citations
Cited by 39 Pith papers
-
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
-
A Sanity Check on Composed Image Retrieval
The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.
-
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Revealing Interpretable Failure Modes of VLMs
REVELIO uncovers interpretable failure modes in VLMs by searching combinatorial concept spaces with diversity-aware beam search and Gaussian-process Thompson sampling, revealing vulnerabilities in autonomous driving a...
-
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
-
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
-
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
PhysNote lets VLMs externalize physical knowledge into hierarchical self-generated notes, stabilizing spatio-temporal reasoning and yielding 56.68% accuracy on PhysBench with a 4.96% gain over the best multi-agent baseline.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals...
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
-
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
-
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Reinforcement learning post-training enables generalization to unseen textual rule variants and visual changes in foundation models, while supervised fine-tuning primarily leads to memorization.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
SGLang: Efficient Execution of Structured Language Model Programs
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
Make Your LVLM KV Cache More Lightweight
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
-
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
-
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering
CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.
-
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt augmentation and preprocessing offering only partial mitigation.
-
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
-
Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification
Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
Reference graph
Works this paper leans on
-
[1]
Fuyu-8b: A multimodal architecture for ai agents
Adept AI. Fuyu-8b: A multimodal architecture for ai agents. https://www.adept.ai/blog/fuyu-8b, 2024. 2
work page 2024
-
[2]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a vi- sual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1, 2, 4, 5, 6, 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Visit-bench: A benchmark for vision- language instruction following inspired by real-world use,
Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wan- rong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt. Visit-bench: A benchmark for vision- language instruction following inspired by real-world use,
-
[5]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023
Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023. 1
-
[7]
Visual instruction tuning with polite flamingo
Delong Chen, Jianfeng Liu, Wenliang Dai, and Baoyuan Wang. Visual instruction tuning with polite flamingo. arXiv preprint arXiv:2307.01003, 2023. 2, 3
-
[8]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof- frey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020. 3
work page 2020
-
[10]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- proved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020. 3
work page internal anchor Pith review arXiv 2003
-
[11]
Pali-x: On scaling up a multilingual vision and language model
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebas- tian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 5
-
[12]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 9
work page 2023
-
[13]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models. arXiv preprint arXiv:2210.11416,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 1, 2, 3, 5, 6, 8, 13
work page internal anchor Pith review arXiv 2023
-
[15]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 4
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[16]
Eva: Exploring the limits of masked visual representation learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358– 19369, 2023. 6
work page 2023
-
[17]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 1, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
arXiv preprint arXiv:2305.04790 , year=
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790,
-
[19]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3, 5, 6, 9
work page 2017
-
[20]
Vizwiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,
-
[21]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 4, 5, 9
work page 2019
-
[22]
Introducing idefics: An open reproduction of state-of-the-art visual language model
IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https : / / huggingface.co/blog/idefics, 2023. 5, 6
work page 2023
-
[23]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Ha- jishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. 2021. If you use this software, please cite it as below. 6
work page 2021
-
[24]
Referitgame: Referring to objects in pho- tographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 4, 9
work page 2014
-
[25]
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 4, 9
work page 2017
-
[26]
Lisa: Reasoning segmentation via large language model,
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692,
-
[27]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Otterhd: A high-resolution multi-modality model, 2023
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi-modality model, 2023. 2
work page 2023
-
[29]
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023. 1
-
[30]
Multimodal founda- tion models: From specialists to general-purpose assistants
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Lin- jie Li, Lijuan Wang, and Jianfeng Gao. Multimodal founda- tion models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023. 1
-
[31]
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and- vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023. 1
-
[32]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 1, 2, 3, 4, 5, 6, 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 5
work page 2014
-
[36]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023. 1, 2, 3, 4, 5, 6, 8, 9
work page 2023
-
[37]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 1, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022. 5
work page 2022
-
[39]
An empirical study of scal- ing instruct-tuned large multimodal models
Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jian- feng Gao, and Yelong Shen. An empirical study of scal- ing instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958, 2023. 1, 4
-
[40]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Cam- buru, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016. 4, 9
work page 2016
-
[41]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4, 9
work page 2019
-
[42]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019. 4, 9
work page 2019
-
[43]
OpenAI. Gpt-4v(ision) system card. https://cdn. openai . com / papers / GPTV _ System _ Card . pdf,
-
[44]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 6
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[45]
A-okvqa: A bench- mark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A bench- mark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022. 4, 9
work page 2022
- [46]
-
[47]
Textcaps: a dataset for image captioning with reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Aman- preet Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020. 4, 9
work page 2020
-
[48]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xin- lei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 8317–8326, 2019. 5
work page 2019
-
[49]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Visionllm: Large language model is also an open-ended decoder for vision-centric tasks
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023. 1
-
[51]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[52]
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023. 6
-
[53]
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Alex Lin, and Fei Huang. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model, 2023. 2
work page 2023
-
[54]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 1
work page Pith review arXiv 2023
-
[55]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 1, 3, 4, 5, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Gpt4roi: Instruction tuning large language model on region-of- interest
Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023. 1
-
[57]
Llavar: Enhanced visual instruction tuning for text-rich image understanding
Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023. 2
-
[58]
Svit: Scaling up visual instruction tuning
Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087,
-
[59]
On evaluating adversarial robustness of large vision-language models
Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongx- uan Li, Ngai-Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023. 1
-
[60]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 10
work page 2023
-
[61]
arXiv preprint arXiv:2305.11206 , year=
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023. 2, 8
-
[62]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.