Recognition: unknown
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Pith reviewed 2026-05-14 21:24 UTC · model grok-4.3
The pith
Data curation alone raises VLM performance by over 11 points across 20 benchmarks while using far less training compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying a data curation pipeline to the MAmmoTH-VL single-image subset while keeping architecture and training fixed, the resulting models achieve +11.7pp average gain on 20 VLM benchmarks and +11.3pp on DatBench, surpass InternVL3.5-2B by 9.9pp at roughly 17 times less training compute, close the gap to Qwen3-VL-2B within 1.8pp at 87 times less compute, reduce per-capability variance by about 67 percent, improve OOD averages by 7.2pp, produce more honest and concise responses on open-ended queries, and deliver higher accuracy at lower response FLOPs at 1B, 2B, and 4B scales.
What carries the argument
The data curation pipeline that filters and selects high-quality single-image training examples from the MAmmoTH-VL dataset.
If this is right
- Per-capability standard deviation across training seeds drops by roughly 67 percent and the gains persist across a 4k-to-16k context-length sweep.
- The nine-eval out-of-distribution average rises by 7.2pp and multi-image BLINK improves by 3.09pp despite single-image-only training.
- Across roughly 1,100 open-ended queries the curated 2B model is more honest, specific, concise, and less refusal-prone than matched baselines.
- At every tested scale the curated model raises accuracy while lowering response FLOPs relative to the matched-compute baseline.
Where Pith is reading between the lines
- If the curation method generalizes across datasets and scales, research attention may shift from raw model size toward systematic data quality work.
- Similar pipelines could be applied to other VLM pretraining corpora to test whether comparable accuracy-compute trade-offs appear at larger parameter counts.
- The combination of higher accuracy and lower inference FLOPs suggests curation can simultaneously improve capability and deployment cost.
- The observed improvements in honesty and specificity on open-ended queries indicate curation can influence response style beyond benchmark accuracy.
Load-bearing premise
That the reported gains are produced solely by the data curation pipeline and not by any unstated differences in training dynamics or evaluation protocols.
What would settle it
Retrain the exact baseline model on the original uncurated MAmmoTH-VL data using identical random seeds, hyperparameters, and evaluation code to check whether the performance gap disappears.
read the original abstract
Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM performance, holding architecture, training recipe, and compute fixed and varying only the training data. Our pipeline, applied to the MAmmoTH-VL single-image subset, lifts performance by +11.7pp on average across 20 public VLM benchmarks (spanning grounding, VQA, OCR/documents, captioning, spatial/3D, counting, charts, math, brand-ID, and multi-image reasoning) and by +11.3pp on average across all nine capability axes of DatBench, our high-fidelity VLM eval suite. At 2B, our curated model surpasses InternVL3.5-2B by 9.9pp at ~17x less training compute and closes the gap to Qwen3-VL-2B to within 1.8pp at ~87x less compute, from pretraining alone. Beyond accuracy, curation delivers four further properties: (1) Reliability: per-capability std across training seeds drops by ~67% and the lift survives a 4k-to-16k context-length sweep; (2) OOD generalization: the 9-eval OOD average rises by +7.2pp, and multi-image BLINK rises by +3.09pp despite single-image-only training, with Visual Correspondence gaining +11.8pp; (3) Behavioral gains beyond benchmarks: across ~1,100 open-ended queries the curated 2B is more honest and more specific than the matched-compute baseline, and more concise and less refusal-prone than a frontier 2B reference; (4) Pareto-dominance on inference cost: at every scale (1B, 2B, 4B) the curated model raises accuracy while lowering response FLOPs vs. the matched-compute baseline, and the curated 4B matches near-frontier accuracy at 3.3x lower response FLOPs than Qwen3-VL-4B. Data curation is a high-leverage tool for building better VLMs, reaching near-frontier accuracy at up to ~150x less training compute.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that a data curation pipeline applied to the MAmmoTH-VL single-image subset, while holding VLM architecture, training recipe, and compute fixed, delivers +11.7pp average gains across 20 public benchmarks (grounding, VQA, OCR, captioning, spatial, counting, charts, math, brand-ID, multi-image) and +11.3pp across all nine DatBench axes. It further reports reduced seed-to-seed variance, improved OOD generalization (including +3.09pp on multi-image BLINK despite single-image training), more honest/specific/concise open-ended behavior, and Pareto improvements in accuracy vs. inference FLOPs at 1B/2B/4B scales, enabling near-frontier performance at up to ~150x lower training compute.
Significance. If the matched-compute attribution holds, the result is significant: it positions data curation as a high-leverage, compute-efficient lever for VLMs that can close much of the gap to frontier models without architectural or scale changes. The multi-axis evaluation (reliability, OOD, behavioral, inference cost) and consistent gains across diverse benchmarks strengthen the case beyond single-metric accuracy. The empirical design with fixed controls is a methodological strength that, if fully verified, would make the findings actionable for the field.
major comments (3)
- [Abstract and Methods] Abstract and Methods: The central claim attributes all gains to data curation alone under identical architecture, recipe, and compute. However, the manuscript provides no explicit confirmation (e.g., hyperparameter tables, seed values, batch-construction logic, data-loading order, or optimizer details) that effective training dynamics were unchanged between baseline and curated runs. This verification is load-bearing for crediting the +11.7pp lift solely to curation rather than subtle implementation differences.
- [§3.2] §3.2 (Data Curation Pipeline): The curation criteria, quality metrics, filtering thresholds, and selection procedures are described at a high level without quantitative details, example filtered samples, or ablation on individual curation steps. This lack of specificity is load-bearing for reproducibility and for confirming that the reported gains generalize beyond the particular MAmmoTH-VL subset and are not artifacts of unstated implementation choices.
- [Results and Evaluation] Results and Evaluation sections: While average improvements are highlighted, the manuscript lacks per-benchmark statistical significance tests, confidence intervals, or variance estimates across the 20 benchmarks and 9 DatBench axes. Given the breadth of capabilities tested, this weakens the ability to rule out that gains are driven by a subset of benchmarks or evaluation-protocol sensitivities.
minor comments (3)
- [Figures] Figure captions and Pareto plots should explicitly annotate the training compute (tokens or FLOPs) for each scale (1B/2B/4B) and reference model to facilitate direct comparison.
- [Abstract] The abstract introduces several acronyms (VLM, OOD, DatBench) without expansion on first use; a brief parenthetical definition would improve readability for a broad audience.
- [§3.2] Consider adding a short table summarizing the exact number of samples retained after each curation stage for transparency.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The three major comments identify important areas where additional detail and rigor will strengthen the manuscript. We address each point below and will incorporate the suggested revisions in the next version.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods: The central claim attributes all gains to data curation alone under identical architecture, recipe, and compute. However, the manuscript provides no explicit confirmation (e.g., hyperparameter tables, seed values, batch-construction logic, data-loading order, or optimizer details) that effective training dynamics were unchanged between baseline and curated runs. This verification is load-bearing for crediting the +11.7pp lift solely to curation rather than subtle implementation differences.
Authors: We agree that explicit verification of matched training dynamics is essential. The manuscript states in Section 3.1 that the same architecture, optimizer, learning-rate schedule, batch size, and random seeds were used for both the baseline and curated runs, with compute matched by token count. However, we did not include a consolidated hyperparameter table or low-level details such as data-loading order. In the revision we will add an appendix table listing all hyperparameters, seeds, batch-construction logic, and optimizer settings, together with a short statement confirming that the only controlled difference between runs was the training data subset. revision: yes
-
Referee: [§3.2] §3.2 (Data Curation Pipeline): The curation criteria, quality metrics, filtering thresholds, and selection procedures are described at a high level without quantitative details, example filtered samples, or ablation on individual curation steps. This lack of specificity is load-bearing for reproducibility and for confirming that the reported gains generalize beyond the particular MAmmoTH-VL subset and are not artifacts of unstated implementation choices.
Authors: We acknowledge that Section 3.2 currently presents the pipeline at a high level. To improve reproducibility we will expand the section with the exact quality metrics (CLIP similarity, caption length, OCR density, etc.), numerical filtering thresholds, and the precise selection procedure. We will also include representative examples of filtered-out and retained samples and add an ablation table quantifying the contribution of each curation step to the final gains. These additions will make the pipeline fully specified and allow readers to assess generalization beyond the MAmmoTH-VL subset. revision: yes
-
Referee: [Results and Evaluation] Results and Evaluation sections: While average improvements are highlighted, the manuscript lacks per-benchmark statistical significance tests, confidence intervals, or variance estimates across the 20 benchmarks and 9 DatBench axes. Given the breadth of capabilities tested, this weakens the ability to rule out that gains are driven by a subset of benchmarks or evaluation-protocol sensitivities.
Authors: The manuscript reports average gains and notes reduced seed-to-seed variance, but does not provide per-benchmark confidence intervals or formal significance tests. In the revised version we will add per-benchmark standard deviations where multiple seeds were run, include 95% confidence intervals for the main averages, and perform paired t-tests (or Wilcoxon tests where appropriate) on the 20 benchmarks and 9 DatBench axes to establish statistical significance of the reported lifts. revision: yes
Circularity Check
No circularity: purely empirical results from matched-compute runs
full rationale
The paper reports benchmark gains from applying a data curation pipeline to MAmmoTH-VL while holding architecture, training recipe, and compute fixed. No equations, derivations, fitted parameters, or predictions appear in the abstract or described claims. Performance lifts (+11.7pp average) are measured directly on external benchmarks rather than derived from self-referential definitions or self-citations. The central attribution to curation alone is an empirical claim open to verification via replication, not a reduction by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 20 public VLM benchmarks and DatBench accurately measure the intended capabilities without substantial bias or leakage
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
doi: 10.18653/v1/2024.findings-acl.602
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.602. URLhttps://aclanthology.org/2024.findings-acl.602/. R. Adiga, B. Nushi, and V. Chandrasekaran. Attention speaks volumes: Localizing and mitigating bias in language models. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting ...
-
[4]
Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1281. URLhttps://aclanthology.org/2025.acl-long.1281/. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: A visual language model for few-shot learning. InAdvances in Neural Informat...
- [5]
-
[6]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
PaliGemma: A versatile 3B VLM for transfer
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. PaliGemma: A versatile 3B VLM for transfer.arXiv preprint arXiv:2407.07726,
work page internal anchor Pith review Pith/arXiv arXiv
- [8]
-
[9]
L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. ShareGPT4V: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
L. Chen, J. Li, X. Dong, P. Zhang, Y. Zang, Z. Chen, H. Duan, J. Wang, Y. Qiao, D. Lin, and F. Zhao. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024a. R. Chen, Y. Wu, L. Chen, G. Liu, Q. He, T. Xiong, C. Liu, J. Guo, and H. Huang. Your vision-language model itself is a strong filter: Towards high-q...
work page internal anchor Pith review Pith/arXiv arXiv
- [11]
-
[12]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
DatologyAI. CLIP gets a data upgrade: Outperforming SoTA with im- proved data curation only, 2024a. URL https://www.datologyai.com/blog/ clip-gets-a-data-upgrade-outperforming-sota-with-improved-data-curation-only. 17 20/20 Vision Language Models DatologyAI. Technical deep-dive: Curating our way to a state-of-the-art text dataset, 2024b. URLhttps://www. d...
work page internal anchor Pith review arXiv
- [13]
-
[14]
C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y. Wu, and R. Ji. MME: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,
work page internal anchor Pith review Pith/arXiv arXiv
- [15]
- [17]
-
[18]
URL https: //proceedings.mlr.press/v238/. S. Joshi, J. Ni, and B. Mirzasoleiman. Dataset distillation via knowledge distillation: Towards efficient self-supervised pre-training of deep networks.International Conference on Learning Representations (ICLR), 2025a. S. Joshi, B. Nushi, V. Balachandran, V. Chandrasekaran, V. Vineet, N. Joshi, and B. Mirzasoleim...
-
[19]
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
work page 2014
-
[20]
K. Killamsetty, D. Sivasubramanian, G. Ramakrishnan, and R. Iyer. GLISTER: Generalization based data subset selection for efficient and robust learning. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021a. K. Killamsetty, X. Zhao, F. Chen, and R. Iyer. RETRIEVE: Coreset selection for efficient and robust semi-supervised learning....
-
[21]
Building and better understanding vision-language models: insights and future directions
H. Laurençon, A. Marafioti, V. Sanh, and L. Tronchon. Building and better understanding vision-language models: insights and future directions.arXiv preprint arXiv:2408.12637, 2024a. 18 20/20 Vision Language Models H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models?arXiv preprint arXiv:2405.02246, 2024b. K. ...
-
[22]
B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. SEED-Bench: Benchmarking multimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023a. J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICM...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
URLhttps://arxiv.org/ abs/2503.22655. H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023a. H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023b. H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. ...
- [24]
-
[25]
Mm1: Methods, analysis & insights from multimodal llm pre-training
B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, et al. MM1: Methods, analysis & insights from multimodal LLM pre-training.arXiv preprint arXiv:2403.09611,
-
[26]
F. Meng, L. Du, Z. Liu, Z. Zhou, Q. Lu, D. Fu, B. Shi, W. Wang, J. He, K. Zhang, et al. MM-Eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Lexical-dense text embeddings used for large-scale data filtering. S. Narang, H. W. Chung, Y. Tay, W. Fedus, T. Fevry, M. Matena, K. Malkan, N. Fedus, D. Bahri, T. Schuster, H. S. Zheng, N. Houlsby, and D. Metzler. Do transformer modifications transfer across implementations and applications? InProceedings of the 2021 Conference on Empirical Methods in Na...
work page 2021
-
[28]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Qwen Team. Qwen3-VL, 2025b. URLhttps://qwenlm.github.io/blog/qwen3-vl/. Qwen Team. Qwen3.5,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
URLhttps://qwen.ai/blog?id=qwen3.5. H. Shen, P. Liu, J. Li, C. Fang, Y. Ma, J. Liao, Q. Shen, Z. Zhang, K. Zhao, Q. Zhang, et al. VLM-R1: A stable and generalizable R1-style large vision-language model.arXiv preprint arXiv:2504.07615,
work page internal anchor Pith review Pith/arXiv arXiv
- [31]
-
[32]
URLhttps://arxiv.org/abs/2404.16123. L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, et al. Dolma: an open corpus of three trillion tokens for language model pretraining research. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL),
-
[33]
D. Su, K. Kong, Y. Lin, J. Jennings, B. Norick, M. Kliegl, M. Patwary, M. Shoeybi, and B. Catanzaro. Nemotron-CC: Transforming Common Crawl into a refined long-horizon pretraining dataset.arXiv preprint arXiv:2412.02595, 2024a. X. Su et al. SK-VQA: Synthetic knowledge generation at scale for training context-augmented multimodal llms, 2024b. URLhttps://ar...
-
[34]
S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024a. S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie. Eyes wide shut? exploring the visual shortcomings of mul...
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. W. Wang, K. Mrini, L. Yang, S. Kumar, Y. Tian, X. Yan, and H. Wang. Finetuned multimodal language models are high-quality image-text data fil...
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
L. Wiedmann, O. Zohar, A. Mahla, X. Wang, R. Li, T. Frere, L. von Werra, A. Roy Gosthipaty, and A. Marafioti. FineVision: Open data is all you need.arXiv preprint arXiv:2510.17269,
- [37]
- [38]
-
[39]
W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,
work page internal anchor Pith review Pith/arXiv arXiv
- [40]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.