pith. machine review for the scientific record. sign in

arxiv: 2403.20330 · v2 · submitted 2024-03-29 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Are We on the Right Way for Evaluating Large Vision-Language Models?

Dahua Lin, Feng Zhao, Haodong Duan, Jiaqi Wang, Jinsong Li, Lin Chen, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yu Qiao, Zehui Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 19:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords large vision-language modelsLVLM evaluationdata leakagemulti-modal benchmarksvisual dependencyMMStarperformance overestimationbenchmark curation
0
0 comments X

The pith

Many current benchmarks let vision-language models answer correctly without using the images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing evaluation sets for large vision-language models contain many questions whose answers can be reached from the text of the question and options alone or from knowledge already present in the underlying language model. This leads to performance numbers that do not reflect genuine multi-modal improvement. The authors therefore built MMStar, a 1,500-sample benchmark in which every item was filtered first by automation and then by human review to guarantee visual necessity, low data leakage, and coverage of six core capabilities.

Core claim

Current LVLM benchmarks suffer from two problems: visual content is unnecessary for many samples because answers are inferable from questions, options, or LLM world knowledge, and unintentional data leakage in training data lets models answer some genuinely visual questions without images. Examples include GeminiPro scoring 42.9 percent on MMMU with no visual input and Sphinx-X-MoE scoring 43.6 percent on the same set. These issues cause misjudgment of actual multi-modal gains. MMStar corrects the problem with 1,500 human-curated samples plus two new metrics that separately quantify data leakage and true performance gain from multi-modal training.

What carries the argument

MMStar, a benchmark of 1,500 samples obtained by automated pre-filtering of existing datasets followed by human review to enforce visual dependency, minimal leakage, and advanced multi-modal requirements.

If this is right

  • Scores on existing benchmarks such as MMMU systematically overestimate true multi-modal capability.
  • Models may improve on leaderboards by exploiting text shortcuts rather than learning to integrate vision and language.
  • Reported gains from multi-modal training are unreliable until leakage is measured and subtracted.
  • The two proposed metrics allow future work to distinguish memorization effects from genuine cross-modal learning.
  • Benchmark design must now prioritize explicit checks for visual necessity to avoid guiding research toward text-only solutions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark creators in other modalities could adopt the same automated-plus-human pipeline to reduce text-only solvability.
  • Widespread adoption of MMStar-style sets would likely produce a temporary slowdown in reported progress until models improve their actual vision components.
  • The leakage findings imply that large-scale training corpora need systematic deduplication against future test sets.
  • Individual model developers could use the leakage metric to audit how much of their performance comes from memorization of public benchmarks.

Load-bearing premise

Human reviewers can reliably identify samples that genuinely require images and contain no leakage without their own knowledge or inconsistencies affecting the selection.

What would settle it

If top LVLMs achieve nearly the same accuracy on MMStar with images removed as they do with images present, the claim that the new benchmark successfully isolates vision-dependent tasks would be falsified.

read the original abstract

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies two issues in existing LVLM benchmarks: (1) many samples do not require visual input, as answers can be inferred from questions, options, or LLM world knowledge (e.g., GeminiPro achieves 42.9% on MMMU without images), and (2) unintentional data leakage from training data allows models to answer visual-necessary questions without images (e.g., Sphinx-X-MoE at 43.6% on MMMU without images, exceeding its LLM backbone). To mitigate misjudgment of multi-modal gains, the authors introduce MMStar, a 1,500-sample benchmark covering 6 core capabilities and 18 axes, curated via an automated pipeline from existing benchmarks followed by human review to enforce visual dependency, minimal leakage, and advanced multi-modal requirements. They also propose two metrics to quantify leakage and actual multi-modal performance gains, and evaluate 16 leading LVLMs on MMStar and 7 other benchmarks.

Significance. If the curation successfully isolates vision-indispensable samples with negligible leakage, MMStar would offer a more accurate benchmark for true multi-modal capabilities than current ones, helping to better measure progress and avoid misguided research directions. The leakage and gain metrics provide a concrete, behavior-based way to diagnose benchmark contamination. The empirical evaluation across models supplies useful comparative data on current LVLM limitations.

major comments (2)
  1. [MMStar construction] In the MMStar construction section: the human review step lacks any reported inter-annotator agreement, explicit decision criteria for 'visual dependency' and 'minimal data leakage', or a described protocol for leakage detection (e.g., systematic zero-image testing on candidates). Because the central claim that MMStar corrects benchmark mismeasurement rests entirely on the purity of these 1,500 samples, the absence of reproducibility details for the filter is a load-bearing gap.
  2. [Metrics and evaluation] In the metrics and evaluation sections: the exact operational definitions of the two proposed metrics (leakage measured on held-out visual-absent inputs, and multi-modal gain) are not fully formalized, including how candidates are held out and how gains are normalized against LLM backbones. This makes it difficult to verify that the metrics avoid circularity with the same models used in filtering.
minor comments (2)
  1. Tables reporting model scores on MMStar and other benchmarks would benefit from explicit mention of whether results are averaged over multiple runs or seeds, and inclusion of standard deviations.
  2. The automated pipeline for initial sample selection is referenced but its precise filtering rules (e.g., thresholds for text-only solvability) are not enumerated, which would aid reproducibility even if human review is the final gate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback, which highlights important aspects of reproducibility and metric formalization. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [MMStar construction] In the MMStar construction section: the human review step lacks any reported inter-annotator agreement, explicit decision criteria for 'visual dependency' and 'minimal data leakage', or a described protocol for leakage detection (e.g., systematic zero-image testing on candidates). Because the central claim that MMStar corrects benchmark mismeasurement rests entirely on the purity of these 1,500 samples, the absence of reproducibility details for the filter is a load-bearing gap.

    Authors: We agree that the human review process requires more explicit documentation to support reproducibility claims. In the revised manuscript, we will add a dedicated subsection detailing: (i) the annotation guidelines and explicit decision criteria for visual dependency (samples where correct answers require image content, verified by human judgment that text-only versions yield near-random performance); (ii) criteria for minimal data leakage (samples where candidate LVLMs achieve performance statistically indistinguishable from random guessing without images); (iii) the leakage detection protocol, which includes systematic zero-image testing on all candidates using multiple models; and (iv) inter-annotator agreement results (e.g., Cohen's kappa and raw agreement rates) computed over a sampled subset reviewed by three independent annotators. These additions will directly address the load-bearing nature of the curation purity. revision: yes

  2. Referee: [Metrics and evaluation] In the metrics and evaluation sections: the exact operational definitions of the two proposed metrics (leakage measured on held-out visual-absent inputs, and multi-modal gain) are not fully formalized, including how candidates are held out and how gains are normalized against LLM backbones. This makes it difficult to verify that the metrics avoid circularity with the same models used in filtering.

    Authors: We acknowledge the need for precise formalization to eliminate any ambiguity around circularity. In the revision, we will introduce mathematical definitions and pseudocode: the leakage metric is defined as Acc_LVLM(text-only) - Acc_random on the final MMStar set; the multi-modal gain metric is [Acc_LVLM(with-image) - Acc_LVLM(text-only)] normalized by subtracting the corresponding LLM backbone's text-only accuracy. We will explicitly state that the initial automated filtering used a disjoint preliminary model set, while the reported metrics are computed on the held-out evaluation of 16 LVLMs after curation is complete, ensuring no overlap in model outputs between filtering and metric calculation. This separation will be documented with a clear pipeline diagram. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements and benchmark construction are self-contained

full rationale

The paper advances no mathematical derivation chain or first-principles predictions. Its central claims rest on direct empirical observations (model accuracy without images on existing benchmarks) and the construction of MMStar through an automated pre-filter plus human review to enforce visual dependency and low leakage. The two new metrics are defined from observable model behavior on held-out visual-absent inputs rather than fitted to the target result. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the core argument; the human-review filter is an external selection step, not a definitional reduction. The derivation is therefore independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that human judgment after automated filtering reliably isolates vision-indispensable items and that the leakage metric accurately reflects memorization rather than other factors.

axioms (1)
  • domain assumption Human reviewers can reliably detect visual dependency and data leakage in benchmark samples
    The entire curation process rests on this judgment step after the automated pipeline.

pith-pipeline@v0.9.0 · 5670 in / 1397 out tokens · 29694 ms · 2026-05-12T19:36:18.644211+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems

    cs.DB 2026-05 conditional novelty 7.0

    OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.

  2. GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

  3. COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

    cs.CV 2026-04 unverdicted novelty 7.0

    COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.

  4. COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

    cs.CV 2026-04 unverdicted novelty 7.0

    COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.

  5. Improving Vision-language Models with Perception-centric Process Reward Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

  6. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 6.0

    Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.

  7. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 unverdicted novelty 6.0

    Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.

  8. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 conditional novelty 6.0

    Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

  9. Reinforcing Multimodal Reasoning Against Visual Degradation

    cs.CV 2026-05 unverdicted novelty 6.0

    ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

  10. Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?

    cs.AI 2026-05 unverdicted novelty 6.0

    Stronger VLM agents use mirror reflections for self-identification in controlled 3D tests, while weaker ones inspect but fail to extract or correctly attribute self-relevant information.

  11. MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

    cs.MM 2026-05 unverdicted novelty 6.0

    MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.

  12. Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.

  13. Segment-Aligned Policy Optimization for Multi-Modal Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.

  14. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.

  15. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.

  16. RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

    cs.LG 2026-04 unverdicted novelty 6.0

    RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster...

  17. CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.

  18. Qwen3-Omni Technical Report

    cs.CL 2025-09 unverdicted novelty 6.0

    Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...

  19. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  20. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  21. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  22. SmolVLM: Redefining small and efficient multimodal models

    cs.AI 2025-04 unverdicted novelty 6.0

    SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

  23. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  24. CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.

  25. Qwen3.5-Omni Technical Report

    cs.CL 2026-04 unverdicted novelty 5.0

    Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...

  26. Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...

  27. Qwen2.5-Omni Technical Report

    cs.CL 2025-03 conditional novelty 5.0

    Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...

  28. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  29. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  30. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

  31. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  32. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 29 Pith papers · 21 internal anchors

  1. [1]

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

  2. [2]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

  3. [3]

    X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024

  4. [4]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- try, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  5. [5]

    L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023

  6. [6]

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, Z. Muyan, Q. Zhang, X. Zhu, L. Lu, et al. In- ternvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023

  7. [7]

    Cheng, Z

    S. Cheng, Z. Guo, J. Wu, K. Fang, P. Li, H. Liu, and Y . Liu. Can vision-language models think from a first-person perspective? arXiv preprint arXiv:2311.15596, 2023

  8. [8]

    Chiang, Z

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023

  9. [9]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  10. [10]

    Contributors

    O. Contributors. Opencompass: A universal evaluation platform for foundation models. https:// github.com/open-compass/opencompass, 2023

  11. [11]

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

  12. [12]

    X. Dong, P. Zhang, Y . Zang, Y . Cao, B. Wang, L. Ouyang, X. Wei, S. Zhang, H. Duan, M. Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision- language large model. arXiv preprint arXiv:2401.16420, 2024

  13. [13]

    Z. Du, Y . Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021

  14. [14]

    C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and R. Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

  15. [15]

    P. Gao, R. Zhang, C. Liu, L. Qiu, S. Huang, W. Lin, S. Zhao, S. Geng, Z. Lin, P. Jin, et al. Sphinx- x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024

  16. [16]

    Goyal, T

    Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017

  17. [17]

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

  18. [18]

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

  19. [19]

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

  20. [20]

    A diagram is worth a dozen images.ArXiv, abs/1603.07396,

    A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. ArXiv, abs/1603.07396, 2016. 18

  21. [21]

    B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023

  22. [22]

    J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

  23. [23]

    Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y . Sun, Y . Liu, and X. Bai. Monkey: Image resolution and text label are important things for large multi-modal models.arXiv preprint arXiv:2311.06607, 2023

  24. [24]

    H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023

  25. [25]

    H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  26. [26]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

  27. [27]

    Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

  28. [28]

    H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, Y . Sun, et al. Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024

  29. [29]

    P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023

  30. [30]

    P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022

  31. [31]

    G. Luo, Y . Zhou, T. Ren, S. Chen, X. Sun, and R. Ji. Cheap and quick: Efficient vision-language instruc- tion tuning for large language models. arXiv preprint arXiv:2305.15023, 2023

  32. [32]

    Phi2: The surprising power of small language models

    Microsoft. Phi2: The surprising power of small language models. https://www.microsoft.com/ en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ , 2023

  33. [33]

    Nous-hermes-2-yi-34b

    NousResearch. Nous-hermes-2-yi-34b. https://huggingface.co/NousResearch/ Nous-Hermes-2-Yi-34B , 2023

  34. [34]

    OpenAI. Chatgpt. https://chat.openai.com/, 2023

  35. [35]

    Gpt-4v(ision) system card

    OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023

  36. [36]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  37. [37]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  38. [38]

    Schwenk, A

    D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision , pages 146–

  39. [39]

    Sharma, N

    P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 2556–2565, 2018

  40. [40]

    Taud and J.-F

    H. Taud and J.-F. Mas. Multilayer perceptron (mlp). Geomatic approaches for modeling land change scenarios, pages 451–455, 2018

  41. [41]

    G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  42. [42]

    I. Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023. 19

  43. [43]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhar- gava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  44. [44]

    J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and Y .-G. Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023

  45. [45]

    W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023

  46. [46]

    H. Wu, Z. Zhang, E. Zhang, C. Chen, L. Liao, A. Wang, C. Li, W. Sun, Q. Yan, G. Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision.arXiv preprint arXiv:2309.14181, 2023

  47. [47]

    A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan, F. Yang, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023

  48. [48]

    Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi, et al. mplug-owl: Modular- ization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023

  49. [49]

    Yi: Open Foundation Models by 01.AI

    A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024

  50. [50]

    W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023

  51. [51]

    X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023

  52. [52]

    Internlm- xcomposer: A vision-language large model for advanced text-image comprehension and composition

    P. Zhang, X. D. B. Wang, Y . Cao, C. Xu, L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan, H. Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023

  53. [53]

    B. Zhou, Y . Hu, X. Weng, J. Jia, J. Luo, X. Liu, J. Wu, and L. Huang. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289, 2024

  54. [54]

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 20