arxiv: 2604.16979 · v1 · submitted 2026-04-18 · 💻 cs.CV · cs.CL

Recognition: unknown

DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

Biao Wu, Ling Chen, Meng Fang, Yiwu Zhong

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:14 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords data selectionmultimodal LLMsdata filteringvision-language modelsimage-text alignmentoff-the-shelf modelsVQA benchmarks

0 comments

The pith

Off-the-shelf pretrained models can filter multimodal data so that training on the selection matches or beats the full dataset on VQA and math tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines if pretrained models never trained on the target multimodal data can still pick good training examples for larger vision-language models. It proposes using these models to score text quality and how well images match their text descriptions, then sampling from a combined distribution in a way that keeps variety in the selected set. If this works, it means data curation costs drop because no new filtering model needs training, and the resulting models perform at least as well as those using all available data. A reader might care because current multimodal datasets are large and messy, so smarter selection could make better models with less compute.

Core claim

Without any task-specific training, off-the-shelf pretrained models can assess text quality and image-text alignment to build a joint quality-alignment distribution. Adaptive weighted sampling from this distribution then selects informative samples while preserving long-tail diversity. Models trained on the resulting DOSE-filtered data match or surpass those trained on the complete dataset across standard VQA and math benchmarks.

What carries the argument

The joint quality-alignment distribution built from off-the-shelf model assessments, paired with adaptive weighted sampling to select informative yet diverse samples.

If this is right

Models trained on DOSE-selected data match or surpass full-dataset performance on VQA and math benchmarks.
The selection process maintains long-tail diversity in the data.
No task-specific training of filtering models is required, reducing computational cost.
The approach scales efficiently to large multimodal datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

General pretrained models appear to capture transferable signals for assessing data quality on new distributions.
The method could extend to data selection in text-only or other multimodal settings.
Using fewer but better samples may lower the environmental cost of training large models.
Combining assessments from multiple off-the-shelf models could refine the selection further.

Load-bearing premise

Off-the-shelf pretrained models never exposed to the target data can reliably judge text quality and image-text alignment.

What would settle it

Training on the DOSE-filtered data and finding that VQA and math benchmark scores fall well below those from the full dataset would disprove the value of this selection.

Figures

Figures reproduced from arXiv: 2604.16979 by Biao Wu, Ling Chen, Meng Fang, Yiwu Zhong.

**Figure 1.** Figure 1: Comparison of data selection methods. (A) Methods based on single-score proxies, often derived from either language or vision signals. (B) Methods that use VLMs as quality evaluators, which may suffer from data contamination or prior exposure when the evaluator has been trained on overlapping corpora. (C) Our method combines off-the-shelf pretrained models without requiring task-specific training on the ta… view at source ↗

**Figure 2.** Figure 2: Left: Combined distribution of Text-Quality Score and CLIP Score, with Text-Quality Score on the x-axis and CLIP Score on the y-axis. Color intensity indicates data density, where brighter colors correspond to higher densities. Right: Comparison between the original data distribution and the distribution after applying WRS on 665K samples from LLaVA Stage 2. The same axes as the left figure are used, with … view at source ↗

**Figure 3.** Figure 3: DOSE Data-Selection Efficiency and Wall-Clock Time Trade-Offs. (Left) Average relative performances [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Overall performance comparisons across different strategies and datasets. (a) and (b) shows the results [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison of different strategies based on CLIP-Score on TextVQA, GQA, MME, and [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

High-quality and diverse multimodal data are essential for improving vision-language models (VLMs), yet existing datasets often contain noisy, redundant, and poorly aligned samples. To address these problems, data filtering is commonly used to enhance the efficiency and performance of multimodal learning, but it introduces extra computational cost because filtering models are usually trained on the same data they are meant to screen. To reduce this cost, we study DOSE, which explores whether off-the-shelf pretrained models that have never seen the target data can be used to select training samples for larger and stronger multimodal models without any task-specific training. Even without fine-tuning, these models can effectively assess text quality and image-text alignment to guide data selection. Based on this, we build a joint quality-alignment distribution and apply adaptive weighted sampling to select informative samples while maintaining long-tail diversity. This approach enhances data diversity, enabling models trained on DOSE-filtered data to match or surpass those trained on the full dataset on standard VQA and math benchmarks. Extensive experiments demonstrate its effectiveness, efficiency, and scalability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DOSE shows off-the-shelf models can curate multimodal data into subsets that train VLMs to match or beat full-dataset results on VQA and math, but the link between their zero-shot scores and actual training value remains the key unproven step.

read the letter

The paper's core point is straightforward: you can skip training custom data filters and instead lean on existing pretrained models to score text quality and image-text alignment, then sample from their joint distribution with adaptive weights to keep long-tail diversity. The result is smaller training sets that produce VLMs performing at or above the full dataset on standard benchmarks, at lower curation cost.

Referee Report

2 major / 2 minor

Summary. The paper introduces DOSE, a data selection framework for training vision-language models (VLMs) that uses off-the-shelf pretrained models (never exposed to the target dataset) to compute zero-shot text-quality and image-text alignment scores. These scores form a joint distribution from which adaptive weighted sampling draws a subset intended to preserve diversity while removing noisy or misaligned samples. The central empirical claim is that VLMs trained on the DOSE-selected subset match or exceed the performance of models trained on the full original dataset on standard VQA and mathematical reasoning benchmarks.

Significance. If the results hold under rigorous controls, the approach would eliminate the need to train task-specific filtering models on the target data, lowering the computational barrier to high-quality multimodal data curation and offering a scalable way to handle redundancy and long-tail distributions in existing datasets.

major comments (2)

[Experiments / Ablation studies] The headline claim—that off-the-shelf zero-shot scores can be used for adaptive sampling to produce a subset whose training utility equals or exceeds the full set—rests on the untested premise that these scores are sufficiently calibrated to downstream VQA/math utility. The manuscript should add a direct correlation analysis (e.g., between per-sample quality/alignment scores and either per-sample loss reduction or influence on final benchmark scores) in the experiments section to demonstrate that the selection criterion is predictive rather than merely correlated with generic data quality.
[§4 (Results) and associated tables/figures] To isolate the benefit of the joint quality-alignment distribution and weighted sampling from simple data reduction, the paper must report performance of random subsampling at the exact same cardinality as the DOSE subset, together with multiple random seeds and error bars. Without these controls, it is impossible to rule out that any observed parity or improvement is an artifact of reduced dataset size or variance in training.

minor comments (2)

[Abstract] The abstract states that 'extensive experiments demonstrate effectiveness' yet contains no numerical results, baseline names, or effect sizes; adding at least one key performance delta (with dataset size) would improve readability.
[Method section] Clarify the precise functional form of the adaptive weighted sampling (including any temperature or normalization constants) and provide the explicit formula for the joint quality-alignment density.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the major comments point-by-point below and will update the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: The headline claim—that off-the-shelf zero-shot scores can be used for adaptive sampling to produce a subset whose training utility equals or exceeds the full set—rests on the untested premise that these scores are sufficiently calibrated to downstream VQA/math utility. The manuscript should add a direct correlation analysis (e.g., between per-sample quality/alignment scores and either per-sample loss reduction or influence on final benchmark scores) in the experiments section to demonstrate that the selection criterion is predictive rather than merely correlated with generic data quality.

Authors: We agree that a direct correlation analysis would strengthen the evidence that our zero-shot scores are predictive of downstream utility rather than merely reflecting generic data quality. While the end-to-end results in the current manuscript demonstrate the effectiveness of DOSE subsets, we have not reported per-sample correlations. In the revised version, we will add this analysis in the experiments section, including Pearson correlations between the quality/alignment scores and per-sample loss reductions on a held-out set, plus feasible approximations to influence where computationally tractable. revision: yes
Referee: To isolate the benefit of the joint quality-alignment distribution and weighted sampling from simple data reduction, the paper must report performance of random subsampling at the exact same cardinality as the DOSE subset, together with multiple random seeds and error bars. Without these controls, it is impossible to rule out that any observed parity or improvement is an artifact of reduced dataset size or variance in training.

Authors: We concur that random subsampling controls at identical cardinality are necessary to isolate the contribution of the joint distribution and adaptive weighting. The revised manuscript will include these baselines in §4 and associated tables/figures, reporting results averaged over multiple random seeds (e.g., 5) with standard error bars to quantify training variance and confirm that observed performance is not an artifact of dataset size. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical selection via external off-the-shelf models

full rationale

The paper describes DOSE as an empirical data-filtering pipeline that applies zero-shot quality and alignment scoring from pretrained models never exposed to the target dataset, followed by adaptive weighted sampling. No equations, parameter fitting, self-definitional loops, or load-bearing self-citations appear in the derivation. Claims of matching or surpassing full-dataset performance rest on downstream benchmark experiments rather than reducing to the input scores by construction. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms or invented entities are identifiable from the abstract alone; the method relies on standard concepts of data quality and alignment assessment.

pith-pipeline@v0.9.0 · 10055 in / 1126 out tokens · 58913 ms · 2026-05-10T06:14:33.329001+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 7 canonical work pages · 2 internal anchors

[1]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Instructblip: Towards general-purpose vision- language models with instruction tuning.CoRR, abs/2305.06500. Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar

work page internal anchor Pith review arXiv
[2]

arXiv preprint arXiv:2309.17425 (2023) 3, 4, 9, 11, 20, 21, 22

Data filtering networks.arXiv preprint arXiv:2309.17425. Meng Fang, Yuan Li, and Trevor Cohn. 2017. Learning how to active learn: A deep reinforcement learning approach. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 595–605, Copenhagen, Denmark. Association for Computational Linguistics. Vitaly Feldman and...

work page arXiv 2017
[3]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Mme: A comprehensive evaluation benchmark for multimodal large language models.Preprint, arXiv:2306.13394. Jiahui Gao, Renjie Pi, Yong Lin, Hang Xu, Jiacheng Ye, Zhiyong Wu, Weizhong Zhang, Xiaodan Liang, Zhenguo Li, and Lingpeng Kong. 2023. Self-guided noise-free data generation for efficient zero-shot learning. InThe Eleventh International Conference on...

work page internal anchor Pith review arXiv 2023
[4]

D2 pruning: Message passing for balancing diversity and difﬁculty in data pruning

D2 pruning: Message passing for balancing di- versity and difficulty in data pruning.arXiv preprint arXiv:2310.07931. Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, and Ari Morcos. 2023. Sieve: Multimodal dataset prun- ing using image captioning models.arXiv preprint arXiv:2310.02110. Pratyush Maini, Sachin Goyal, Zach...

work page arXiv 2023
[5]

Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

Scaling data-constrained language models. arXiv preprint arXiv:2305.16264. Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, and Ludwig Schmidt. 2023. Improv- ing multimodal datasets with image captioning.Ad- vances in Neural Information Processing Systems, 36:22047–22069. OpenAI. 2023. Gpt-4 technical report.Preprint, arXiv:2303.08774. Mansh...

work page arXiv 2023
[6]

How to train data-efficient llms.arXiv preprint arXiv:2402.09668, 2024

How to train data-efficient llms.arXiv preprint arXiv:2402.09668. Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. 2024. Math-LLaV A: Bootstrapping mathemati- cal reasoning for multimodal large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4663–4680, Mi...

work page arXiv 2024
[7]

Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, and 1 others

Too large; data reduction for vision-language pre-training.arXiv preprint arXiv:2305.20087. Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, and 1 others. 2025. Infinity parser: Layout aware reinforcement learn- ing for scanned document parsing.arXiv preprint arXiv:2506.03197. Kai Wei, Ris...

work page arXiv 2025
[8]

C Time Cost Analysis Figure 3 presents a joint analysis of model perfor- mance and wall-clock cost across different data selection strategies

The WRS sampling distribution shows a pro- nounced concentration in regions with higher CLIP and Text-Quality Scores, effectively validating our strategy for assessing data quality and demonstrat- ing the benefits of our sampling approach. C Time Cost Analysis Figure 3 presents a joint analysis of model perfor- mance and wall-clock cost across different d...

2024