arxiv: 2412.10302 · v1 · submitted 2024-12-13 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Aixin Liu, Bingxuan Wang, Chengyue Wu, Chong Ruan, Damai Dai, Haowei Zhang, Huazuo Gao, Jiawei Wang, Kai Dong, Kai Hu, Kang Guan, Liang Zhao, Wen Liu, Xiaokang Chen, Xingchao Liu, Xingkai Yu, Xin Xie, Yaofeng Sun, Yishi Piao, Yisong Wang, Yiyang Ma, Yukun Li, Yu Wu, Yuxiang You, Zhenda Xie, Zhiyu Wu, Zizheng Pan

Authors on Pith no claims yet

Pith reviewed 2026-05-11 10:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords mixture of expertsvision-language modelsdynamic tilingmulti-head latent attentionmultimodal tasksefficient inferencehigh-resolution images

0 comments

The pith

DeepSeek-VL2 matches or exceeds prior vision-language models on multimodal tasks while using fewer activated parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DeepSeek-VL2 as an upgrade to earlier vision-language models through two main changes: a dynamic tiling method that encodes high-resolution images of varying shapes and sizes, and Multi-head Latent Attention that shrinks the key-value cache in the language component. These changes sit inside a Mixture-of-Experts framework and are paired with an improved training dataset. The resulting models, with 1.0B to 4.5B activated parameters, reach competitive or leading results on visual question answering, text recognition in images, document understanding, and visual grounding. If the gains hold, the work shows that targeted architectural choices can deliver strong multimodal performance without scaling up total compute.

Core claim

DeepSeek-VL2 incorporates dynamic tiling for vision encoding to handle high-resolution images with different aspect ratios and uses DeepSeekMoE models with Multi-head Latent Attention to compress key-value caches, enabling efficient inference. Trained on an improved vision-language dataset, the three variants achieve competitive or state-of-the-art performance across multimodal tasks with similar or fewer activated parameters than existing open-source dense and MoE models.

What carries the argument

Dynamic tiling vision encoding paired with Multi-head Latent Attention inside a Mixture-of-Experts language model, which processes variable-aspect-ratio images efficiently and reduces inference memory and latency.

Load-bearing premise

The performance improvements come primarily from the dynamic tiling strategy and Multi-head Latent Attention rather than from the improved training dataset or other tuning details.

What would settle it

Train an otherwise identical model without dynamic tiling or without Multi-head Latent Attention and check whether its scores on the reported benchmarks fall below the competitive range achieved by the full DeepSeek-VL2 variants.

read the original abstract

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepSeek-VL2 is a practical engineering release of three open MoE vision-language models that hit competitive benchmark numbers at low activated parameter counts, but the lack of ablations makes it hard to credit the new components over data improvements.

read the letter

The main takeaway is that this paper ships three new open MoE vision-language models (1.0B, 2.8B, and 4.5B activated parameters) that match or exceed prior open models on VQA, OCR, document understanding, and grounding while keeping inference costs reasonable. The two advertised changes are dynamic tiling for the vision encoder and DeepSeekMoE with multi-head latent attention for the language side, trained on an improved dataset.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DeepSeek-VL2, a family of Mixture-of-Experts vision-language models (DeepSeek-VL2-Tiny, Small, and the base model with 1.0B/2.8B/4.5B activated parameters). It describes two primary architectural upgrades over DeepSeek-VL: a dynamic tiling strategy for vision encoding that accommodates high-resolution images with varying aspect ratios, and the use of DeepSeekMoE equipped with Multi-head Latent Attention to compress KV caches for efficient inference. The models are trained on an improved vision-language dataset and are claimed to deliver competitive or state-of-the-art results on visual question answering, OCR, document/table/chart understanding, and visual grounding tasks while using similar or fewer activated parameters than prior open-source dense and MoE models.

Significance. If the performance numbers hold under scrutiny, the work would illustrate how targeted changes in vision tokenization and attention mechanisms can support strong multimodal capabilities at modest activated-parameter budgets, which is relevant for practical deployment. The public release of code and checkpoints is a clear positive for reproducibility.

major comments (2)

[Abstract and §1] Abstract and §1 (Introduction): The text states that the models 'significantly improves upon its predecessor... through two key major upgrades' and achieve their results 'thanks to' dynamic tiling and Multi-head Latent Attention. However, the experimental section provides no ablation that fixes the training dataset, data mixture, and optimization schedule while removing or replacing only the dynamic tiling (reverting to fixed-resolution encoding) or only the Multi-head Latent Attention (reverting to standard attention within the MoE layers). Without such controls, the causal contribution of the two architectural changes to the reported efficiency-performance trade-off cannot be isolated from possible gains due to the 'improved vision-language dataset' or unstated hyperparameter differences.
[Experimental results] Experimental results (tables comparing against other models): The benchmark tables report point estimates for the three variants but do not include standard deviations across multiple runs, confidence intervals, or statistical tests. This makes it difficult to determine whether the claimed 'competitive or state-of-the-art' margins are robust, especially for the smaller 1.0B and 2.8B variants where variance is typically higher.

minor comments (2)

[Model architecture description] The manuscript would benefit from an explicit table or paragraph comparing total (non-activated) parameter counts alongside the activated counts for both DeepSeek-VL2 variants and the baseline models; this would clarify the sparsity level achieved by the MoE design.
[Figures] Figure captions for the dynamic tiling illustration and the attention mechanism diagram could be expanded to include the exact mathematical formulation or pseudocode for the tiling selection and latent vector compression steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address the two major comments point by point below, indicating the revisions we intend to make to strengthen the presentation of our contributions.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (Introduction): The text states that the models 'significantly improves upon its predecessor... through two key major upgrades' and achieve their results 'thanks to' dynamic tiling and Multi-head Latent Attention. However, the experimental section provides no ablation that fixes the training dataset, data mixture, and optimization schedule while removing or replacing only the dynamic tiling (reverting to fixed-resolution encoding) or only the Multi-head Latent Attention (reverting to standard attention within the MoE layers). Without such controls, the causal contribution of the two architectural changes to the reported efficiency-performance trade-off cannot be isolated from possible gains due to the 'improved vision-language dataset' or unstated hyperparameter differences.

Authors: We appreciate the referee's point that the current experiments do not isolate the individual effects of dynamic tiling and Multi-head Latent Attention through controlled ablations with fixed data and training. The manuscript presents these two upgrades as the primary architectural changes enabling improved handling of high-resolution images and efficient inference, in combination with the enhanced vision-language dataset. While internal development confirmed their importance, we did not run the specific ablations described. We will revise the abstract and Section 1 to describe the performance as resulting from the combination of the architectural upgrades and the improved dataset, avoiding language that implies isolated causality. We will also add a short discussion paragraph on the design motivations for each upgrade, drawing on their individual properties and comparisons to prior approaches. revision: partial
Referee: [Experimental results] Experimental results (tables comparing against other models): The benchmark tables report point estimates for the three variants but do not include standard deviations across multiple runs, confidence intervals, or statistical tests. This makes it difficult to determine whether the claimed 'competitive or state-of-the-art' margins are robust, especially for the smaller 1.0B and 2.8B variants where variance is typically higher.

Authors: We agree that including variability measures would allow readers to better assess the robustness of the reported results. Training each model variant requires substantial compute, making multiple independent runs impractical in our setting. We will revise the experimental section to explicitly note that all results are from single training runs and add a limitation statement in the discussion or conclusion. We will also qualify the 'competitive or state-of-the-art' claims in the text where the margins are modest, consistent with reporting practices in other large-scale multimodal model papers. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper presents an empirical vision-language model with two described upgrades (dynamic tiling vision encoding and DeepSeekMoE with Multi-head Latent Attention) plus training on an improved dataset, followed by standard benchmark evaluations. No derivation chain, first-principles prediction, or fitted parameter is claimed; performance numbers are reported outcomes of training and testing rather than quantities defined in terms of themselves. Self-citations to prior DeepSeek MoE work exist but are not load-bearing for any tautological reduction, as the central claims rest on external benchmark scores rather than internal redefinitions or unverified self-references.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard supervised training of transformer-based MoE models plus two engineering choices (dynamic tiling and latent attention) whose benefits are measured empirically rather than derived.

free parameters (1)

activated parameter counts
1.0B, 2.8B and 4.5B values chosen for the three model variants.

axioms (1)

domain assumption Mixture-of-Experts routing improves inference efficiency without harming quality when trained properly
Invoked when claiming high throughput with fewer activated parameters.

pith-pipeline@v0.9.0 · 5633 in / 1154 out tokens · 49105 ms · 2026-05-11T10:04:21.128133+00:00 · methodology

discussion (0)

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CHASM: Unveiling Covert Advertisements on Chinese Social Media
cs.LG 2026-04 unverdicted novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
cs.CL 2026-05 unverdicted novelty 7.0

BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
cs.CV 2026-05 unverdicted novelty 7.0

SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.
SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images
cs.AI 2026-04 unverdicted novelty 7.0

SpecVQA is a new benchmark dataset and evaluation suite for testing multimodal large language models on scientific spectral image understanding and visual question answering, supported by a curve-preserving sampling m...
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning...
Can Multimodal Large Language Models Truly Understand Small Objects?
cs.CV 2026-04 unverdicted novelty 7.0

Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
cs.RO 2026-04 unverdicted novelty 7.0

GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

HyperGVL is the first benchmark for LVLMs on hypergraph tasks from basic counting to NP-hard reasoning, with 12 models tested and a router proposed to adapt representations.
RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

RemoteAgent uses RL fine-tuning on VagueEO to align MLLMs for vague EO intent recognition, handling simple tasks internally and routing dense predictions to tools via Model Context Protocol.
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
cs.CV 2026-04 conditional novelty 7.0

SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
PolyReal: A Benchmark for Real-World Polymer Science Workflows
cs.CV 2026-04 unverdicted novelty 7.0

PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
cs.LG 2026-05 unverdicted novelty 6.0

VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction
cs.CV 2026-05 unverdicted novelty 6.0

ChartZero achieves zero-shot line chart data extraction by training only on synthetic mathematical functions, using a Global Orthogonal Instance loss to prevent curve fragmentation and a VLM-guided strategy for legend...
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
cs.CV 2026-05 unverdicted novelty 6.0

Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
cs.CV 2026-04 unverdicted novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates
cs.CV 2026-04 unverdicted novelty 6.0

DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithme...
MLLM-as-a-Judge Exhibits Model Preference Bias
cs.CV 2026-04 unverdicted novelty 6.0

MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
cs.DC 2026-04 unverdicted novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes","Hands" and "Minds"
cs.CV 2026-04 unverdicted novelty 6.0

EchoAgent is a new agentic AI system that integrates visual observation, quantitative measurement, and expert knowledge reasoning to achieve reliable echocardiography interpretation with up to 80% accuracy on CAMUS an...
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
SmolVLM: Redefining small and efficient multimodal models
cs.AI 2025-04 unverdicted novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
cs.CV 2026-05 unverdicted novelty 5.0

SciVQR is a new multimodal benchmark covering 54 scientific subfields that evaluates MLLMs on visual comprehension and multi-step reasoning, revealing significant limitations in leading models.
UniMesh: Unifying 3D Mesh Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
Bias-constrained multimodal intelligence for equitable and reliable clinical AI
cs.CV 2026-04 unverdicted novelty 5.0

BiasCareVL is a bias-aware vision-language framework trained on 3.44 million medical samples that outperforms prior methods on clinical tasks like diagnosis and segmentation while aiming for equitable performance unde...
AstroVLM: Expert Multi-agent Collaborative Reasoning for Astronomical Imaging Quality Diagnosis
cs.MA 2026-04 unverdicted novelty 5.0

AstroVLM deploys expert multi-agent collaboration with VLMs to outperform baselines on real-world astronomical imaging quality diagnosis.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
cs.AI 2026-04 unverdicted novelty 5.0

DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
cs.AI 2026-04 unverdicted novelty 5.0

DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment
cs.CV 2026-04 unverdicted novelty 4.0

Geometric Reward Credit Assignment disentangles rewards to geometric tokens and adds reprojection consistency to boost 3D keypoint accuracy from 0.64 to 0.93 and bounding box IoU to 0.686 on a ShapeNetCore benchmark w...
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
cs.AI 2026-04 unverdicted novelty 4.0

PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while fi...
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
cs.LG 2026-03 unverdicted novelty 2.0

The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · cited by 38 Pith papers · 18 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

M. Abdin, J. Aneja, H. Awadalla, A. Awadallah, A. A. Awan, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Wave-ui 25k

agentsea. Wave-ui 25k. https://huggingface.co/datasets/agentsea/wave-u i-25k, 2024

work page 2024
[3]

Pixtral 12B

P . Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. De Monicault, S. Garg, T. Gervet, et al. Pixtral 12b. arXiv preprint arXiv:2410.07073, 2024

work page internal anchor Pith review arXiv 2024
[4]

arXiv preprint arXiv:1905.13319 , year=

A. Amini, S. Gabriel, P . Lin, R. Koncel-Kedziorski, Y. Choi, and H. Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019

work page arXiv 1905
[5]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-s onnet, 2024

work page 2024
[6]

Y. Bai, X. Du, Y. Liang, Y. Jin, Z. Liu, J. Zhou, T. Zheng, X. Zhang, N. Ma, Z. Wang, et al. Coig-cqia: Quality is all you need for chinese instruction fine-tuning. arXiv preprint arXiv:2403.18058, 2024

work page arXiv 2024
[7]

L. Blecher. Latex-ocr — a tool to convert images of latex equations into latex code. https://github.com/lukas-blecher/LaTeX-OCR, 2023. Accessed: 2023-10-17

work page 2023
[8]

O. B. Bohan and H. Face. Megalith 10m dataset. https://huggingface.co/dataset s/madebyollin/megalith-10m, 2024

work page 2024
[9]

M. Cai, H. Liu, S. K. Mustikovela, G. P . Meyer, Y. Chai, D. Park, and Y. J. Lee. Vip-llava: Making large multimodal models understand arbitrary visual prompts. In CVPR, pages 12914–12923. IEEE, 2024

work page 2024
[10]

G. H. Chen, S. Chen, R. Zhang, J. Chen, X. Wu, Z. Zhang, Z. Chen, J. Li, X. Wan, and B. Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024

work page arXiv 2024
[11]

K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review arXiv 2023
[12]

L. Chen, J. Li, X. Dong, P . Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal models with better captions. ECCV, 2023

work page 2023
[13]

L. Chen, J. Li, X. Dong, P . Zhang, Y. Zang, Z. Chen, H. Duan, J. Wang, Y. Qiao, D. Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review arXiv 2024
[14]

W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang. Tabfact: A large-scale dataset for table-based fact verification. In International Conference on Learning Representations

work page
[15]

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P . Luo, T. Lu, Y. Qiao, and J. Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 21

work page internal anchor Pith review arXiv 2023
[16]

Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, et al. Internvl2: Better than the best—expanding performance boundaries of open-source multi- modal models with the progressive scaling strategy, 2024

work page 2024
[17]

Cherian, K.-C

A. Cherian, K.-C. Peng, S. Lohit, K. Smith, and J. B. Tenenbaum. Are deep neural networks smarter than second graders? arXiv preprint arXiv:2212.09993, 2022

work page arXiv 2022
[18]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Conover, M

M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P . Wendell, M. Zaharia, and R. Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm,

work page
[20]

URL https://www.databricks.com/blog/2023/04/12/dolly-first-ope n-commercially-viable-instruction-tuned-llm

work page 2023
[21]

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review arXiv 2024
[22]

W. Dai, N. Lee, B. Wang, Z. Yang, Z. Liu, J. Barker, T. Rintamaki, M. Shoeybi, B. Catanzaro, and W. Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint, 2024

work page 2024
[23]

S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al

M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146, 2024

work page arXiv 2024
[24]

X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[25]

M. Diem, S. Fiel, F. Kleber, R. Sablatnig, J. M. Saavedra, D. Contreras, J. M. Barrios, and L. S. Oliveira. Icfhr 2014 competition on handwritten digit string recognition in challenging datasets (hdsrc 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition, pages 779–784. IEEE, 2014

work page 2014
[26]

B. Egan, A. Redden, XWAVE, and SilentAntagonist. Dalle3 1 Million+ High Quality Captions, May 2024. URL https://huggingface.co/datasets/ProGamerGov/sy nthetic-dataset-1m-dalle3-high-quality-captions

work page 2024
[27]

C. Fu, P . Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y. Wu, and R. Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024. URL https://arxiv.org/abs/2306.13394

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Goyal, T

Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[29]

J. Gu, X. Meng, G. Lu, L. Hou, N. Minzhe, X. Liang, L. Yao, R. Huang, W. Zhang, X. Jiang, C. Xu, and H. Xu. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. In NeurIPS, 2022

work page 2022
[30]

C. He, Z. Jin, C. Xu, J. Qiu, B. Wang, W. Li, H. Yan, J. Wang, and D. Lin. Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models.arXiv preprint arXiv:2308.10755, 2023. 22

work page arXiv 2023
[31]

HAI-LLM: Efficient and lightweight training tool for large models, 2023

High-flyer. HAI-LLM: Efficient and lightweight training tool for large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

work page 2023
[32]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019
[33]

Hurst, A

A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Weli- hinda, A. Hayes, A. Radford, et al. Gpt-4v(ision) system card. 2023

work page 2023
[34]

Kazemzadeh, V

S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

work page 2014
[35]

Kembhavi, M

A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–

work page 2016
[37]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. White- head, A. C. Berg, W.-Y. Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

work page 2023
[38]

Kirstain, A

Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In NeurIPS, 2023

work page 2023
[39]

Koupaee and W

M. Koupaee and W. Y. Wang. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305, 2018

work page arXiv 2018
[40]

Kuznetsova, H

A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V . Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020

work page 2020
[41]

Laion-aesthetics, 2023

LAION. Laion-aesthetics, 2023. URL https://laion.ai/blog/laion-aesthetics. Accessed: 2023-10-27

work page 2023
[42]

Laurençon, L

H. Laurençon, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karamcheti, A. M. Rush, D. Kiela, M. Cord, and V . Sanh. OBELICS: an open web-scale filtered dataset of interleaved image-text documents. In NeurIPS, 2023

work page 2023
[43]

Laurençon, A

H. Laurençon, A. Marafioti, V . Sanh, and L. Tronchon. Building and better understanding vision-language models: insights and future directions., 2024

work page 2024
[44]

Laurençon, L

H. Laurençon, L. Tronchon, M. Cord, and V . Sanh. What matters when building vision- language models?, 2024

work page 2024
[45]

Laurençon, L

H. Laurençon, L. Tronchon, and V . Sanh. Unlocking the conversion of web screenshots into html code with the websight dataset, 2024

work page 2024
[46]

B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 23

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

D. Li, Y. Liu, H. Wu, Y. Wang, Z. Shen, B. Qu, X. Niu, G. Wang, B. Chen, and J. Li. Aria: An open multimodal native mixture-of-experts model. arXiv preprint arXiv:2410.05993, 2024

work page arXiv 2024
[48]

F. Li, R. Zhang, H. Zhang, Y. Zhang, B. Li, W. Li, Z. Ma, and C. Li. Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

L. Li, Y. Wang, R. Xu, P . Wang, X. Feng, L. Kong, and Q. Liu. Multimodal ArXiv: A dataset for improving scientific comprehension of large vision-language models. In ACL, 2024

work page 2024
[50]

L. Li, Y. Wang, R. Xu, P . Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231, 2024

work page arXiv 2024
[51]

X. Li, F. Zhang, H. Diao, Y. Wang, X. Wang, and L.-Y. Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.arXiv preprint arXiv:2407.08303, 2024

work page arXiv 2024
[52]

Z. Li, X. Yang, K. Choi, W. Zhu, R. Hsieh, H. Kim, J. H. Lim, S. Ji, B. Lee, X. Yan, et al. Mmsci: A dataset for graduate-level multi-discipline multimodal scientific understanding. arXiv preprint arXiv:2407.04903, 2024

work page arXiv 2024
[53]

F. Lin, J. Yuan, S. Wu, F. Wang, and Z. Wang. Uninext: Exploring a unified architecture for vision recognition. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3200–3208, 2023

work page 2023
[54]

A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023

work page 2023
[56]

H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog /2024-01-30-llava-next/

work page 2024
[57]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2025

work page 2025
[58]

Y. Liu, Z. Li, B. Yang, C. Li, X. Yin, C.-l. Liu, L. Jin, and X. Bai. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023

work page arXiv 2023
[59]

Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025

work page 2025
[60]

H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024. 24

work page internal anchor Pith review arXiv 2024
[61]

P . Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations

work page
[62]

P . Lu, L. Qiu, J. Chen, T. Xia, Y. Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021

work page arXiv 2021
[63]

C. Ma, Y. Jiang, J. Wu, Z. Yuan, and X. Qi. Groma: Localized visual tokenization for grounding multimodal large language models. In European Conference on Computer Vision, pages 417–435. Springer, 2025

work page 2025
[64]

Y. Ma, X. Liu, X. Chen, W. Liu, C. Wu, Z. Wu, Z. Pan, Z. Xie, H. Zhang, L. Zhao, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal under- standing and generation. arXiv preprint arXiv:2411.07975, 2024

work page arXiv 2024
[65]

J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

work page 2016
[66]

Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

work page arXiv 2022
[67]

Mathew, D

M. Mathew, D. Karatzas, and C. Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

work page 2021
[68]

Mathew, V

M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022

work page 2022
[69]

Mitra, H

A. Mitra, H. Khanpour, C. Rosset, and A. Awadallah. Orca-math: Unlocking the potential of slms in grade school math, 2024

work page 2024
[70]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. https://openai.com/research/gpt-4v-sys tem-card, 2023

work page 2023
[71]

B. Peng, C. Li, P . He, M. Galley, and J. Gao. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277, 2023

work page internal anchor Pith review arXiv 2023
[72]

Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

work page internal anchor Pith review arXiv 2023
[73]

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazeb- nik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image- to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

work page 2015
[74]

Large-scale classifica- tion of fine-art paintings: Learning the right metric on the right feature

B. Saleh and A. Elgammal. Large-scale classification of fine-art paintings: Learning the right metric on the right feature. arXiv preprint arXiv:1505.00855, 2015

work page arXiv 2015
[75]

S. Shah, A. Mishra, N. Yadati, and P . P . Talukdar. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8876–8884, 2019. 25

work page 2019
[76]

S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019

work page 2019
[77]

W. Shi, Z. Hu, Y. Bin, J. Liu, Y. Yang, S.-K. Ng, L. Bing, and R. K.-W. Lee. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. arXiv preprint arXiv:2406.17294, 2024

work page arXiv 2024
[78]

Singh, V

A. Singh, V . Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. To- wards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019
[79]

Singla, K

V . Singla, K. Yue, S. Paul, R. Shirkavand, M. Jayawardhana, A. Ganjdanesh, H. Huang, A. Bhatele, G. Somepalli, and T. Goldstein. From pixels to prose: A large dataset of dense image captions. CoRR, abs/2406.10328, 2024

work page arXiv 2024
[80]

Srinivasan, K

K. Srinivasan, K. Raman, J. Chen, M. Bendersky, and M. Najork. Wit: Wikipedia-based im- age text dataset for multimodal multilingual machine learning. In SIGIR, page 2443–2449, 2021

work page 2021
[81]

K. Sun, J. Pan, Y. Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y. Wang, J. Dai, Y. Qiao, L. Wang, and H. Li. Journeydb: A benchmark for generative image understanding. In NeurIPS, 2023

work page 2023

Showing first 80 references.