pith. machine review for the scientific record. sign in

arxiv: 2403.05525 · v2 · submitted 2024-03-08 · 💻 cs.AI

Recognition: 3 theorem links

· Lean Theorem

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Bingxuan Wang, Bo Liu, Bo Zhang, Chengqi Deng, Chong Ruan, Hanwei Xu, Hao Yang, Haoyu Lu, Jingxiang Sun, Kai Dong, Tongzheng Ren, Wen Liu, Yaofeng Sun, Zhenda Xie, Zhuoshu Li

Pith reviewed 2026-05-11 17:51 UTC · model grok-4.3

classification 💻 cs.AI
keywords vision-language modelsmultimodal understandingreal-world applicationsinstruction tuninghybrid vision encoderopen-source VL modelschatbot performance
0
0 comments X

The pith

DeepSeek-VL models achieve competitive or state-of-the-art results on vision-language benchmarks while delivering strong practical user experiences as chatbots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DeepSeek-VL as an open-source family of vision-language models built specifically for real-world use. It covers three main choices: collecting diverse data that includes web screenshots, PDFs, OCR, charts, and knowledge content; deriving an instruction-tuning set from a taxonomy of actual user scenarios; and applying a hybrid vision encoder plus an early-integrated pretraining method to keep language skills intact while handling high-resolution images efficiently. The central claim is that these steps together produce models that feel better to users in everyday applications and still score well on standard benchmarks. If the approach holds, smaller open models could become reliable for document, chart, and screenshot tasks without needing large compute or sacrificing text-only performance.

Core claim

The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks.

What carries the argument

A hybrid vision encoder that handles 1024x1024 images at low compute cost, paired with a VL pretraining strategy that integrates LLM training from the start to balance vision and language modalities.

If this is right

  • Smaller open-source models become practical choices for real-world visual tasks such as document and chart understanding.
  • Instruction tuning drawn from actual use-case taxonomies measurably improves everyday user experience.
  • Early integration of language-model training during pretraining preserves performance on text-only benchmarks.
  • High-resolution image processing becomes feasible in VL models without large increases in compute overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same data-and-training balance could be tested on other multimodal tasks where one modality tends to dominate training.
  • Public release of these base models may speed up development of specialized tools for web and document analysis.
  • Emphasis on use-case-derived data might reduce the gap between lab benchmark scores and real deployment performance.

Load-bearing premise

The combination of diverse real-world data, the hybrid encoder, and the balanced pretraining strategy will produce clear gains in user experience and benchmark scores without hidden losses in capability or efficiency.

What would settle it

Direct side-by-side testing on the same visual-language benchmarks or user studies that shows the models falling behind other models of equal size in either accuracy or perceived chatbot quality.

read the original abstract

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DeepSeek-VL, an open-source family of vision-language models (1.3B and 7B) for real-world applications. It details a data pipeline emphasizing diversity and coverage of practical scenarios (web screenshots, PDFs, OCR, charts, knowledge content), construction of an instruction-tuning dataset via a use-case taxonomy derived from real user scenarios, a hybrid vision encoder supporting 1024x1024 images at modest compute cost, and an early-integration VL pretraining strategy that interleaves LLM training to preserve language capabilities while addressing modality competition. The central claims are that the resulting models provide superior chatbot user experience in practical settings, achieve SOTA or competitive scores on a range of VL benchmarks at comparable sizes, and retain robust performance on language-centric benchmarks, with both model sizes released publicly.

Significance. If the empirical results hold, the work offers a publicly available VL model explicitly tuned for real-world utility and efficiency, with a design that prioritizes retention of base LLM strengths. The hybrid encoder and taxonomy-driven instruction data represent concrete engineering choices that could inform subsequent multimodal systems. Public model release enables direct verification and extension.

major comments (2)
  1. [§3] Abstract and §3 (VL pretraining strategy): The claim that the early-integration pretraining 'ensures the preservation of LLM capabilities' and yields 'robust performance on language-centric benchmarks' is load-bearing for the no-trade-off assertion. The manuscript reports only final VL-model scores; without side-by-side tables comparing the 1.3B/7B DeepSeek-VL variants to the unmodified DeepSeek-LLM baselines on identical language tasks (e.g., MMLU, GSM8K), the effectiveness of the strategy in managing vision-language competition cannot be verified.
  2. [§4] §4 (Experiments and benchmarks): The abstract asserts 'state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size.' The reported numbers must be accompanied by explicit model-size-matched baselines (e.g., LLaVA-1.5-7B, Qwen-VL-7B) and ablation results isolating the contribution of the hybrid encoder and the use-case instruction dataset; otherwise the superiority claim rests on incomplete controls.
minor comments (2)
  1. [§2.2] The hybrid vision encoder architecture (described in §2.2) would benefit from a diagram showing the integration points with the LLM and the exact tokenization of high-resolution patches.
  2. [§2.1] Dataset statistics (total tokens, image-text pair counts, taxonomy coverage percentages) are referenced but not tabulated; adding a summary table would strengthen the 'diverse, scalable' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [§3] Abstract and §3 (VL pretraining strategy): The claim that the early-integration pretraining 'ensures the preservation of LLM capabilities' and yields 'robust performance on language-centric benchmarks' is load-bearing for the no-trade-off assertion. The manuscript reports only final VL-model scores; without side-by-side tables comparing the 1.3B/7B DeepSeek-VL variants to the unmodified DeepSeek-LLM baselines on identical language tasks (e.g., MMLU, GSM8K), the effectiveness of the strategy in managing vision-language competition cannot be verified.

    Authors: We agree that direct side-by-side comparisons to the base DeepSeek-LLM models on language-only benchmarks would provide clearer evidence for the effectiveness of our early-integration pretraining strategy. We will add a dedicated table in the revised manuscript reporting results on MMLU, GSM8K, and similar tasks for both the 1.3B and 7B DeepSeek-VL models alongside the unmodified DeepSeek-LLM baselines. revision: yes

  2. Referee: [§4] §4 (Experiments and benchmarks): The abstract asserts 'state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size.' The reported numbers must be accompanied by explicit model-size-matched baselines (e.g., LLaVA-1.5-7B, Qwen-VL-7B) and ablation results isolating the contribution of the hybrid encoder and the use-case instruction dataset; otherwise the superiority claim rests on incomplete controls.

    Authors: The experiments section already includes comparisons against size-matched models such as LLaVA-1.5-7B and Qwen-VL-7B on the reported VL benchmarks. To address the request for stronger controls, we will explicitly annotate all tables with model sizes and add ablation studies that isolate the impact of the hybrid vision encoder and the use-case taxonomy instruction dataset. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical model development with external benchmark validation

full rationale

The paper presents an empirical vision-language model (DeepSeek-VL) whose central claims rest on reported benchmark scores and user-experience improvements from instruction tuning on a constructed dataset. No mathematical derivations, equations, or predictions are present that could reduce to fitted parameters or self-definitions by construction. The described pretraining strategy, hybrid encoder, and data curation are design choices justified by practical considerations and external evaluations rather than internal circular logic. Self-citations, if any, are not load-bearing for the performance claims, which are falsifiable against public benchmarks. This is a standard model-release paper whose validity hinges on reproducibility of the reported numbers, not on any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not enumerate specific free parameters, axioms, or invented entities; central claims depend on empirical outcomes from data selection and training procedures whose details are not provided.

pith-pipeline@v0.9.0 · 5635 in / 1211 out tokens · 97623 ms · 2026-05-11T17:51:50.939356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 8.0

    SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

  2. HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

    cs.CV 2026-04 accept novelty 8.0

    HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

  3. The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    Recorruption arises from visual attention suppression and positional bias in multimodal RAG; BAIR mitigates it via bottleneck attention intervention at inference time.

  4. Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

    cs.CV 2026-04 conditional novelty 7.0

    Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...

  5. Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

    cs.LG 2026-04 unverdicted novelty 7.0

    PND reduces object hallucination in VLMs via a dual-path contrast during decoding that amplifies visual features and penalizes linguistic priors, achieving reported SOTA results on POPE, MME, and CHAIR without retraining.

  6. PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation

    cs.CV 2026-04 unverdicted novelty 7.0

    PBS-VL trained on the new PBSInstr dataset outperforms general and pathology MLLMs on the PBSBench VQA tasks for hematopathology.

  7. ROSE: Retrieval-Oriented Segmentation Enhancement

    cs.CV 2026-04 unverdicted novelty 7.0

    ROSE is a retrieval-augmented plug-in that improves MLLM segmentation on novel and emerging entities by fetching web text and images and deciding when to use them.

  8. DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized metho...

  9. RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

    cs.CV 2026-05 unverdicted novelty 6.0

    RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...

  10. AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion

    cs.CV 2026-05 unverdicted novelty 6.0

    AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.

  11. VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

    cs.CR 2026-05 conditional novelty 6.0

    Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.

  12. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  13. Online Self-Calibration Against Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...

  14. SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 6.0

    SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.

  15. SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

  16. R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

    cs.CV 2026-04 conditional novelty 6.0

    R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.

  17. If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

    cs.CV 2026-04 unverdicted novelty 6.0

    LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.

  18. UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

    cs.CV 2026-04 unverdicted novelty 6.0

    UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...

  19. Boosting Visual Instruction Tuning with Self-Supervised Guidance

    cs.CV 2026-04 unverdicted novelty 6.0

    Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.

  20. Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.

  21. Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

  22. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  23. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  24. Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

    cs.CV 2026-05 unverdicted novelty 5.0

    ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.

  25. DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training

    cs.LG 2026-05 unverdicted novelty 5.0

    DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts w...

  26. Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    A self-captioning method using a Multimodal Interaction Gate amplifies redundant interactions to reduce visual-induced errors by 38.3% and improve consistency by 16.8% in vision-language models.

  27. Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

    cs.CV 2026-05 unverdicted novelty 5.0

    Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.

  28. Make Your LVLM KV Cache More Lightweight

    cs.CV 2026-05 unverdicted novelty 5.0

    LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.

  29. AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

    cs.CL 2026-04 unverdicted novelty 5.0

    AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.

  30. UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

    cs.DC 2026-04 unverdicted novelty 5.0

    UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.

  31. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  32. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  33. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 33 Pith papers · 15 internal anchors

  1. [1]

    URL https://www.anthropic.com/index/introd ucing-claude. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,

  2. [2]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966,

  3. [3]

    URL https://github.com/lukas-blecher /LaTeX-OCR. L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418,

  4. [4]

    Burns, K

    A. Burns, K. Srinivasan, J. Ainslie, G. Brown, B. A. Plummer, K. Saenko, J. Ni, and M. Guo. A suite of generative tasks for multi-level multimodal webpage understanding. In The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  5. [5]

    23 L. Chen, J. Li, X. Dong, P . Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793,

  6. [6]

    C. K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding, et al. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1571–1576. IEEE,

  7. [7]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  8. [8]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

  9. [9]

    URL https://github.com/deepseek-ai/DeepSeek-L LM. X. Dong, P . Zhang, Y. Zang, Y. Cao, B. Wang, L. Ouyang, X. Wei, S. Zhang, H. Duan, M. Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420,

  10. [10]

    G-llava: Solving geometric problem with multi-modal large language model

    W. Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org. J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y. Wang, L. Hong, J. Han, H. Xu, Z. Li, et al. G- llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370,

  11. [11]

    L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

  12. [12]

    URL https://blog.google/tech nology/ai/bard-google-ai-search-updates/ . D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

  13. [13]

    URL https://www.high-flyer.c n/en/blog/hai-llm. Y.-C. Hsiao, F. Zubach, M. Wang, et al. Screenqa: Large-scale question-answer pairs over mobile app screenshots. arXiv preprint arXiv:2209.08199,

  14. [14]

    A. Hu, Y. Shi, H. Xu, J. Ye, Q. Ye, M. Yan, C. Li, Q. Qian, J. Zhang, and F. Huang. mplug- paperowl: Scientific diagram analysis with the multimodal large language model. arXiv preprint arXiv:2311.18248,

  15. [15]

    Segment Anything

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.277. URL https://aclanthology.org/2022.acl-long.277. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643,

  16. [16]

    Kulkarni and J

    A. Kulkarni and J. Truelsen. wkhtmltopdf. https://wkhtmltopdf.org/. Project maintained by Ashish Kulkarni, originally created by Jakob Truelsen. Accessed: 2024-02-22. LAION. Gpt-4v dataset. https://huggingface.co/datasets/laion/gpt4v-dataset ,

  17. [17]

    B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a. S. Li and N. Tajbakhsh. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs,

  18. [18]

    Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget captioning: Generating natural language description for mobile user interface elements. arXiv preprint arXiv:2010.04295,

  19. [19]

    Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b. J. Lin, H. Yin, W. Ping, Y. Lu, P . Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023a. Z. Lin...

  20. [20]

    P . Lu, L. Qiu, J. Chen, T. Xia, Y. Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214,

  21. [21]

    P . Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P . Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022a. P . Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P . Clark, and A....

  22. [22]

    Masry, P

    A. Masry, P . Kavehzadeh, X. L. Do, E. Hoque, and S. Joty. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761,

  23. [23]

    Nayef, F

    N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, pages 1454–1459. IEEE,

  24. [24]

    N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,

  25. [25]

    B. Shi, C. Yao, M. Liao, M. Yang, P . Xu, L. Cui, S. Belongie, S. Lu, and X. Bai. Icdar2017 competition on reading chinese text in the wild (rctw-17). In 2017 14th iapr international conference on document analysis and recognition (ICDAR), volume 1, pages 1429–1434. IEEE,

  26. [26]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, P . LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

  27. [27]

    Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222,

  28. [28]

    Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1557–1562. IEEE,

  29. [29]

    G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

  30. [30]

    S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209,

  31. [31]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, D. Bikel, L. Blech...

  32. [32]

    J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and Y.-G. Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023a. W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b....

  33. [33]

    Y. Yang, A. Panagopoulou, Q. Lyu, L. Zhang, M. Yatskar, and C. Callison-Burch. Visual goal-step inference using wikihow. arXiv preprint arXiv:2104.05845,

  34. [34]

    J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhang, et al. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126,

  35. [35]

    Q. Yu, Q. Sun, X. Zhang, Y. Cui, F. Zhang, Y. Cao, X. Wang, and J. Liu. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023a. W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b. X. Yue, Y. Ni,...

  36. [36]

    Zellers, A

    R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors,Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association fo...

  37. [37]

    H ella S wag: Can a Machine Really Finish Your Sentence?

    doi: 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p1 9-1472. B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32,

  38. [38]

    Zhang, X

    G. Zhang, X. Du, B. Chen, Y. Liang, T. Luo, T. Zheng, K. Zhu, Y. Cheng, C. Xu, S. Guo, et al. Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark. arXiv preprint arXiv:2401.11944,

  39. [39]

    Zhang, Y

    28 R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao, M. Yang, et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR), pages 1577–1581. IEEE,

  40. [40]

    Zhang, L

    Y. Zhang, L. Gueguen, I. Zharkov, P . Zhang, K. Seifert, and B. Kadlec. Uber-text: A large- scale dataset for optical character recognition from street-level imagery. In SUNw: Scene Understanding Workshop - CVPR 2017, Hawaii, U.S.A.,

  41. [41]

    Agieval: A human-centric benchmark for evaluating foundation models

    W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364,

  42. [42]

    Agieval: A human-centric benchmark for evaluating foundation models

    doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364. W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre, J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y. Wang, and Y. Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems, 36,