Recognition: 2 theorem links
· Lean TheoremAre We on the Right Way for Evaluating Large Vision-Language Models?
Pith reviewed 2026-05-12 19:36 UTC · model grok-4.3
The pith
Many current benchmarks let vision-language models answer correctly without using the images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current LVLM benchmarks suffer from two problems: visual content is unnecessary for many samples because answers are inferable from questions, options, or LLM world knowledge, and unintentional data leakage in training data lets models answer some genuinely visual questions without images. Examples include GeminiPro scoring 42.9 percent on MMMU with no visual input and Sphinx-X-MoE scoring 43.6 percent on the same set. These issues cause misjudgment of actual multi-modal gains. MMStar corrects the problem with 1,500 human-curated samples plus two new metrics that separately quantify data leakage and true performance gain from multi-modal training.
What carries the argument
MMStar, a benchmark of 1,500 samples obtained by automated pre-filtering of existing datasets followed by human review to enforce visual dependency, minimal leakage, and advanced multi-modal requirements.
If this is right
- Scores on existing benchmarks such as MMMU systematically overestimate true multi-modal capability.
- Models may improve on leaderboards by exploiting text shortcuts rather than learning to integrate vision and language.
- Reported gains from multi-modal training are unreliable until leakage is measured and subtracted.
- The two proposed metrics allow future work to distinguish memorization effects from genuine cross-modal learning.
- Benchmark design must now prioritize explicit checks for visual necessity to avoid guiding research toward text-only solutions.
Where Pith is reading between the lines
- Benchmark creators in other modalities could adopt the same automated-plus-human pipeline to reduce text-only solvability.
- Widespread adoption of MMStar-style sets would likely produce a temporary slowdown in reported progress until models improve their actual vision components.
- The leakage findings imply that large-scale training corpora need systematic deduplication against future test sets.
- Individual model developers could use the leakage metric to audit how much of their performance comes from memorization of public benchmarks.
Load-bearing premise
Human reviewers can reliably identify samples that genuinely require images and contain no leakage without their own knowledge or inconsistencies affecting the selection.
What would settle it
If top LVLMs achieve nearly the same accuracy on MMStar with images removed as they do with images present, the claim that the new benchmark successfully isolates vision-dependent tasks would be falsified.
read the original abstract
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies two issues in existing LVLM benchmarks: (1) many samples do not require visual input, as answers can be inferred from questions, options, or LLM world knowledge (e.g., GeminiPro achieves 42.9% on MMMU without images), and (2) unintentional data leakage from training data allows models to answer visual-necessary questions without images (e.g., Sphinx-X-MoE at 43.6% on MMMU without images, exceeding its LLM backbone). To mitigate misjudgment of multi-modal gains, the authors introduce MMStar, a 1,500-sample benchmark covering 6 core capabilities and 18 axes, curated via an automated pipeline from existing benchmarks followed by human review to enforce visual dependency, minimal leakage, and advanced multi-modal requirements. They also propose two metrics to quantify leakage and actual multi-modal performance gains, and evaluate 16 leading LVLMs on MMStar and 7 other benchmarks.
Significance. If the curation successfully isolates vision-indispensable samples with negligible leakage, MMStar would offer a more accurate benchmark for true multi-modal capabilities than current ones, helping to better measure progress and avoid misguided research directions. The leakage and gain metrics provide a concrete, behavior-based way to diagnose benchmark contamination. The empirical evaluation across models supplies useful comparative data on current LVLM limitations.
major comments (2)
- [MMStar construction] In the MMStar construction section: the human review step lacks any reported inter-annotator agreement, explicit decision criteria for 'visual dependency' and 'minimal data leakage', or a described protocol for leakage detection (e.g., systematic zero-image testing on candidates). Because the central claim that MMStar corrects benchmark mismeasurement rests entirely on the purity of these 1,500 samples, the absence of reproducibility details for the filter is a load-bearing gap.
- [Metrics and evaluation] In the metrics and evaluation sections: the exact operational definitions of the two proposed metrics (leakage measured on held-out visual-absent inputs, and multi-modal gain) are not fully formalized, including how candidates are held out and how gains are normalized against LLM backbones. This makes it difficult to verify that the metrics avoid circularity with the same models used in filtering.
minor comments (2)
- Tables reporting model scores on MMStar and other benchmarks would benefit from explicit mention of whether results are averaged over multiple runs or seeds, and inclusion of standard deviations.
- The automated pipeline for initial sample selection is referenced but its precise filtering rules (e.g., thresholds for text-only solvability) are not enumerated, which would aid reproducibility even if human review is the final gate.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback, which highlights important aspects of reproducibility and metric formalization. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [MMStar construction] In the MMStar construction section: the human review step lacks any reported inter-annotator agreement, explicit decision criteria for 'visual dependency' and 'minimal data leakage', or a described protocol for leakage detection (e.g., systematic zero-image testing on candidates). Because the central claim that MMStar corrects benchmark mismeasurement rests entirely on the purity of these 1,500 samples, the absence of reproducibility details for the filter is a load-bearing gap.
Authors: We agree that the human review process requires more explicit documentation to support reproducibility claims. In the revised manuscript, we will add a dedicated subsection detailing: (i) the annotation guidelines and explicit decision criteria for visual dependency (samples where correct answers require image content, verified by human judgment that text-only versions yield near-random performance); (ii) criteria for minimal data leakage (samples where candidate LVLMs achieve performance statistically indistinguishable from random guessing without images); (iii) the leakage detection protocol, which includes systematic zero-image testing on all candidates using multiple models; and (iv) inter-annotator agreement results (e.g., Cohen's kappa and raw agreement rates) computed over a sampled subset reviewed by three independent annotators. These additions will directly address the load-bearing nature of the curation purity. revision: yes
-
Referee: [Metrics and evaluation] In the metrics and evaluation sections: the exact operational definitions of the two proposed metrics (leakage measured on held-out visual-absent inputs, and multi-modal gain) are not fully formalized, including how candidates are held out and how gains are normalized against LLM backbones. This makes it difficult to verify that the metrics avoid circularity with the same models used in filtering.
Authors: We acknowledge the need for precise formalization to eliminate any ambiguity around circularity. In the revision, we will introduce mathematical definitions and pseudocode: the leakage metric is defined as Acc_LVLM(text-only) - Acc_random on the final MMStar set; the multi-modal gain metric is [Acc_LVLM(with-image) - Acc_LVLM(text-only)] normalized by subtracting the corresponding LLM backbone's text-only accuracy. We will explicitly state that the initial automated filtering used a disjoint preliminary model set, while the reported metrics are computed on the held-out evaluation of 16 LVLMs after curation is complete, ensuring no overlap in model outputs between filtering and metric calculation. This separation will be documented with a clear pipeline diagram. revision: yes
Circularity Check
No significant circularity; empirical measurements and benchmark construction are self-contained
full rationale
The paper advances no mathematical derivation chain or first-principles predictions. Its central claims rest on direct empirical observations (model accuracy without images on existing benchmarks) and the construction of MMStar through an automated pre-filter plus human review to enforce visual dependency and low leakage. The two new metrics are defined from observable model behavior on held-out visual-absent inputs rather than fitted to the target result. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the core argument; the human-review filter is an external selection step, not a definitional reduction. The derivation is therefore independent of its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human reviewers can reliably detect visual dependency and data leakage in benchmark samples
Lean theorems connected to this paper
-
Foundation.LawOfExistencedefect_zero_iff_one unclearVisual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs.
Forward citations
Cited by 32 Pith papers
-
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.
-
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
-
Improving Vision-language Models with Perception-centric Process Reward Models
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
Reinforcing Multimodal Reasoning Against Visual Degradation
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
-
Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?
Stronger VLM agents use mirror reflections for self-identification in controlled 3D tests, while weaker ones inspect but fail to extract or correctly attribute self-relevant information.
-
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
-
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
-
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
-
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster...
-
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering
CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Reference graph
Works this paper leans on
-
[1]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [4]
-
[5]
L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023
work page internal anchor Pith review arXiv 2023
-
[6]
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, Z. Muyan, Q. Zhang, X. Zhu, L. Lu, et al. In- ternvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023
work page internal anchor Pith review arXiv 2023
- [7]
- [8]
-
[9]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
O. Contributors. Opencompass: A universal evaluation platform for foundation models. https:// github.com/open-compass/opencompass, 2023
work page 2023
-
[11]
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
work page 2023
- [12]
- [13]
-
[14]
C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and R. Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [15]
- [16]
-
[17]
C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[18]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
A diagram is worth a dozen images.ArXiv, abs/1603.07396,
A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. ArXiv, abs/1603.07396, 2016. 18
-
[21]
B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [23]
-
[24]
H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[26]
H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, Y . Sun, et al. Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022
work page 2022
- [31]
-
[32]
Phi2: The surprising power of small language models
Microsoft. Phi2: The surprising power of small language models. https://www.microsoft.com/ en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ , 2023
work page 2023
-
[33]
NousResearch. Nous-hermes-2-yi-34b. https://huggingface.co/NousResearch/ Nous-Hermes-2-Yi-34B , 2023
work page 2023
-
[34]
OpenAI. Chatgpt. https://chat.openai.com/, 2023
work page 2023
-
[35]
OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023
work page 2023
- [36]
-
[37]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[38]
D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision , pages 146–
-
[39]
P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 2556–2565, 2018
work page 2018
-
[40]
H. Taud and J.-F. Mas. Multilayer perceptron (mlp). Geomatic approaches for modeling land change scenarios, pages 451–455, 2018
work page 2018
-
[41]
G. Team, R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
I. Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023. 19
work page 2023
-
[43]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhar- gava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [44]
- [45]
- [46]
- [47]
-
[48]
Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y . Zhou, J. Wang, A. Hu, P. Shi, Y . Shi, et al. mplug-owl: Modular- ization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023
work page Pith review arXiv 2023
-
[49]
Yi: Open Foundation Models by 01.AI
A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024
work page internal anchor Pith review arXiv 2024
-
[50]
W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [51]
-
[52]
P. Zhang, X. D. B. Wang, Y . Cao, C. Xu, L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan, H. Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023
- [53]
-
[54]
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 20
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.