arxiv: 2410.07073 · v2 · submitted 2024-10-09 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Pixtral 12B

Pravesh Agrawal , Szymon Antoniak , Emma Bou Hanna , Baptiste Bout , Devendra Chaplot , Jessica Chudnovsky , Diogo Costa , Baudouin De Monicault , Saurabh Garg , Theophile Gervet , Soham Ghosh , Am\'elie H\'eliou , Paul Jacob , Albert Q. Jiang , Kartik Khandelwal , Timoth\'ee Lacroix , Guillaume Lample , Diego Las Casas , Thibaut Lavril , Teven Le Scao , Andy Lo , William Marshall , Louis Martin , Arthur Mensch , Pavankumar Muddireddy , Valera Nemychnikova , Marie Pellat , Patrick von Platen , Nikhil Raghuraman , Baptiste Rozi\`ere , Alexandre Sablayrolles , Lucile Saulnier , Romain Sauvestre , Wendy Shang , Roman Soletskyi , Lawrence Stewart , Pierre Stock , Joachim Studnia , Sandeep Subramanian , Sagar Vaze , Thomas Wang , Sophia Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:48 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords multimodal language modelvision encodernative resolutionimage understandingdocument processingopen source modelbenchmark evaluation

0 comments

The pith

Pixtral-12B outperforms similar and larger open multimodal models by processing images at their native resolution and aspect ratio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pixtral-12B is a 12-billion-parameter model trained to handle both natural images and documents while remaining strong at pure text tasks. It relies on a vision encoder built from scratch that accepts images in their original size and shape rather than forcing fixed crops or downsampling. This design gives users control over how many tokens represent each image and supports any number of images inside a 128,000-token context. The model reports higher scores than other open models of roughly the same size and even beats much larger open models on standard multimodal benchmarks.

Core claim

Pixtral-12B is a 12-billion-parameter multimodal language model that understands natural images and documents. It achieves leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size and does not compromise on natural language performance. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens.

What carries the argument

A new vision encoder trained from scratch that ingests images at their natural resolution and aspect ratio, allowing flexible token counts per image.

If this is right

The model can accept variable numbers of tokens per image depending on content detail.
Any number of images can be included inside the 128K context window.
Text-only performance remains competitive with dedicated language models of similar size.
The contributed MM-MT-Bench and evaluation protocols provide a standardized way to measure practical vision-language capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Native-resolution encoding may reduce artifacts on fine-grained document tasks compared with fixed-resolution encoders.
Flexible token budgets per image could lower compute cost for simple scenes while preserving detail where needed.
Open release under Apache 2.0 may enable direct fine-tuning on domain-specific image-text pairs.

Load-bearing premise

The reported benchmark scores reflect fair, standardized evaluation without undisclosed differences in training data scale, filtering, or test-set contamination.

What would settle it

Re-running the exact same benchmark suite on Pixtral-12B and the compared models using identical prompts, evaluation code, and data splits would show whether the performance gaps persist.

read the original abstract

We introduce Pixtral-12B, a 12--billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pixtral 12B adds a native-resolution vision encoder and 128K multi-image context to open multimodal models while keeping text performance intact, but the paper stays light on ablations and training details.

read the letter

Pixtral 12B is a 12B open multimodal model that trains its own vision encoder from scratch instead of bolting on a fixed-grid one. This lets it take images at native resolution and aspect ratio, so the token count per image can vary with the content. It also handles any number of images inside a 128K context window and ships a new benchmark, MM-MT-Bench, aimed at practical vision-language scenarios. The headline numbers show it beating Llama-3.2 11B and Qwen-2-VL 7B on multimodal tasks and even the much larger Llama-3.2 90B while staying competitive on plain text. The Apache 2.0 release plus evaluation code is the part that actually moves the field forward for people who need to run these models locally or fine-tune them for documents and visual reasoning.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Pixtral-12B, a 12-billion-parameter multimodal language model trained to understand natural images and documents. It features a new vision encoder that ingests images at native resolution and aspect ratio with flexible token counts, supports any number of images within a 128K-token context window, and reports leading results on multimodal benchmarks while preserving strong text-only performance. The work also releases the open MM-MT-Bench for practical vision-language evaluation and provides code for standardized multimodal LLM protocols. Pixtral-12B is claimed to substantially outperform open models of similar size (Llama-3.2 11B, Qwen-2-VL 7B) and even larger models such as Llama-3.2 90B while being 7x smaller.

Significance. If the benchmark comparisons prove reproducible under identical evaluation conditions, the result would be significant: it would demonstrate that architectural choices in the vision encoder and context handling can yield competitive or superior multimodal performance at modest scale, reducing reliance on massive parameter counts. The open release of both the model (Apache 2.0) and the MM-MT-Bench benchmark, together with standardized evaluation code, would further strengthen the contribution by enabling direct community verification and extension.

major comments (2)

[Abstract] Abstract and evaluation sections: the headline claim that Pixtral-12B outperforms Llama-3.2 90B while 7x smaller rests on direct benchmark comparability, yet the manuscript supplies no quantitative details on training-data volume, filtering, test-set overlap, image tokenization, resolution handling, or prompt templates used for all baselines. Without these, the reported gains cannot be confidently attributed to the new vision encoder rather than data or protocol differences.
[Results] Results and experimental setup: no ablation studies, training-recipe details, or error bars are provided for the multimodal benchmark scores. This absence makes it impossible to isolate the contribution of the native-resolution vision encoder or to assess statistical robustness of the cross-model comparisons.

minor comments (1)

[Abstract] Abstract: 'substanially' is a typographical error and should read 'substantially'.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on the Pixtral-12B manuscript. We address each major comment below with clarifications and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation sections: the headline claim that Pixtral-12B outperforms Llama-3.2 90B while 7x smaller rests on direct benchmark comparability, yet the manuscript supplies no quantitative details on training-data volume, filtering, test-set overlap, image tokenization, resolution handling, or prompt templates used for all baselines. Without these, the reported gains cannot be confidently attributed to the new vision encoder rather than data or protocol differences.

Authors: We agree that expanded details on evaluation protocols would improve transparency. In the revised manuscript we will add quantitative information on our own training data volume, filtering steps, image tokenization strategy, native-resolution handling, and prompt templates used. For the baseline models we followed the officially published benchmark numbers and evaluation protocols from their respective papers and leaderboards. Detailed training-data volumes and filtering for proprietary models such as Llama-3.2 are not publicly disclosed, so we will add an explicit limitations paragraph discussing this constraint and its implications for attribution. revision: partial
Referee: [Results] Results and experimental setup: no ablation studies, training-recipe details, or error bars are provided for the multimodal benchmark scores. This absence makes it impossible to isolate the contribution of the native-resolution vision encoder or to assess statistical robustness of the cross-model comparisons.

Authors: We will expand the experimental section and appendix with additional training-recipe details and will report error bars obtained from repeated evaluations on the main benchmark tables. Comprehensive ablations isolating every vision-encoder component were not performed due to compute limits, but we will include a more detailed discussion of the design choices and their expected impact on performance to help readers assess the contribution of native-resolution processing. revision: yes

standing simulated objections not resolved

Quantitative details on training-data volume, filtering, and test-set overlap for all proprietary baseline models (e.g., Llama-3.2), which are not publicly available.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark reporting

full rationale

The paper presents Pixtral-12B as an empirical multimodal model release, with all claims consisting of benchmark scores on standard vision-language tasks. No equations, derivations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation load-bearing steps appear in the abstract or described content. The central performance assertions rest on external benchmark comparisons rather than any internal reduction to the model's own inputs or prior self-referential results, rendering the evaluation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on large-scale empirical training of a transformer-based multimodal architecture; no explicit free parameters, axioms, or invented entities are declared in the abstract.

pith-pipeline@v0.9.0 · 5713 in / 1009 out tokens · 45075 ms · 2026-05-14T23:48:54.453453+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lost in Translation: Do LVLM Judges Generalize Across Languages?
cs.CL 2026-04 unverdicted novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning
cs.CV 2026-05 unverdicted novelty 7.0

VT-Bench is the first unified benchmark aggregating 14 visual-tabular datasets with over 756K samples and evaluating 23 models to expose challenges in this multi-modal area.
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
cs.AI 2026-04 unverdicted novelty 7.0

Rule-VLN is the first large-scale benchmark injecting 177 regulatory categories into an urban environment, and the proposed SNRM module equips pre-trained VLN agents with zero-shot semantic reasoning and detour planni...
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
cs.CV 2026-03 unverdicted novelty 7.0

KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Instruction token embeddings encode visual information that can be leveraged to detect object hallucinations in MLLMs via a new combined score outperforming prior detectors.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs
cs.CV 2026-04 unverdicted novelty 6.0

VLMs exhibit a consistent 'Texture Bias Cliff' and fail to comprehend pure geometric shapes from boundary contours alone in zero-shot settings.
MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision
cs.RO 2026-04 unverdicted novelty 6.0

Vision-language models generate executable Behavior Tree policies for robots from synthetic vision-language data, with successful transfer demonstrated on two real manipulators.
Perception Encoder: The best visual embeddings are not at the output of the network
cs.CV 2025-04 unverdicted novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 5.0

PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
Assessing Y-Axis Influence: Bias in Multimodal Language Models on Chart-to-Table Translation
cs.AI 2026-04 unverdicted novelty 5.0

Y-axis features such as major tick digit length, number of ticks, value range, and format introduce significant biases in multimodal models during chart-to-table tasks, with y-axis prompting improving performance for ...
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
cs.AI 2026-04 unverdicted novelty 5.0

Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding
cs.CL 2026-04 unverdicted novelty 5.0

Linear probing reveals a gap between internal representations and responses in LVLMs for visual document understanding, with task information encoded more linearly in intermediate layers than the final layer, and fine...
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
Ministral 3
cs.CL 2026-01 unverdicted novelty 4.0

Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
Phoenix-VL 1.5 Medium Technical Report
cs.CL 2026-05 unverdicted novelty 3.0

Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying comp...
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 22 Pith papers · 10 internal anchors

[1]

The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic (2024). The Claude 3 Model Family: Opus, Sonnet, Haiku. https: //www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_ Card_Claude_3.pdf

work page 2024
[2]

Bavishi, R., Elsen, E., Hawthorne, C., Nye, M., Odena, A., Somani, A., and Ta¸ sırlar, S. (2023). Fuyu-8b: A multimodal architecture for ai agents

work page 2023
[3]

M., et al

Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I. M., et al. (2024). Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. Advances in Neural Information Processing Systems, 36

work page 2024
[4]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y ., Park, J. S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., et al. (2024). Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv preprint arXiv:2409.17146

work page internal anchor Pith review arXiv 2024
[5]

Dosovitskiy, A. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020
[6]

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021). Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y ., Liu, Z., and Li, C. (2024). Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Li, X., Wang, Z., and Xie, C. (2023). An inverse scaling law for clip training. In NeurIPS

work page 2023
[11]

and Harada, T

Li, Y . and Harada, T. (2022). Lepard: Learning partial point cloud matching in rigid and deformable scenes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5554–5564

work page 2022
[12]

Liu, H., Li, C., Li, Y ., and Lee, Y . J. (2024a). Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 26296–26306

work page
[13]

Liu, H., Li, C., Wu, Q., and Lee, Y . J. (2024b). Visual instruction tuning.Advances in neural information processing systems, 36

work page
[14]

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J. (2023). Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Mistral NeMo 12B

MistralAI (2024). Mistral NeMo 12B. https://mistral.ai/news/mistral-nemo/

work page 2024
[16]

OpenAI, R. et al. (2023). Gpt-4 technical report. ArXiv, 2303:08774

work page 2023
[17]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR

work page 2021
[18]

Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., Alayrac, J.-b., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., et al. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Shazeer, N. (2020). Glu variants improve transformer. arXiv preprint arXiv:2002.05202. 16

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063

work page 2024
[21]

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30

work page 2017
[23]

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y ., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and Lin, J. (2024). Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

work page 2024
[24]

Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., et al. (2023). Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arxiv

work page 2023
[25]

Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623

work page 2023
[26]

[[rating]]

Zhong, W., Cui, R., Guo, Y ., Liang, Y ., Lu, S., Wang, Y ., Saied, A., Chen, W., and Duan, N. (2023). Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364. 17 Appendix Table of Contents A Prompts 19 A.1 MMMU and Mathvista . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.2 ChartQA . . . . . ....

work page arXiv 2023