pith. machine review for the scientific record. sign in

arxiv: 2309.16671 · v6 · submitted 2023-09-28 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

Demystifying CLIP Data

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:16 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords MetaCLIPCLIPdata curationimage-text pairszero-shot classificationCommonCrawlcontrastive pre-trainingvision-language models
0
0 comments X

The pith

MetaCLIP balances CommonCrawl image-text pairs using CLIP-derived metadata to exceed original CLIP performance on zero-shot benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that CLIP's effectiveness comes mainly from its data curation process rather than model architecture or loss. It introduces MetaCLIP, which extracts metadata from CLIP concepts and applies explicit balancing over that distribution to select a high-quality subset from a raw pool such as CommonCrawl. With 400 million pairs the resulting dataset reaches 70.8 percent zero-shot ImageNet accuracy on ViT-B models, compared with 68.3 percent for the original CLIP data. Scaling the curated set to one billion pairs lifts accuracy to 72.4 percent under the same training budget, and the gains persist across model scales including ViT-H at 80.5 percent. The work releases the curation code and metadata distribution to make the process reproducible.

Core claim

MetaCLIP takes a raw data pool and metadata derived from CLIP's concepts and yields a balanced subset over the metadata distribution. When applied to CommonCrawl, the 400-million-pair version outperforms the original CLIP data on multiple standard benchmarks; zero-shot ImageNet accuracy rises from 68.3 percent to 70.8 percent on ViT-B, and scaling to one billion pairs reaches 72.4 percent while holding training compute fixed.

What carries the argument

Metadata-Curated Language-Image Pre-training (MetaCLIP), a curation procedure that extracts concepts from CLIP and explicitly balances image-text pairs over the resulting metadata distribution.

If this is right

  • Zero-shot ImageNet accuracy improves from 68.3 percent to 70.8 percent on ViT-B without any change to model size or training schedule.
  • Scaling the balanced dataset to one billion pairs produces an additional lift to 72.4 percent under identical compute.
  • The same curation yields consistent gains across model capacities, reaching 80.5 percent with ViT-H.
  • Releasing the metadata distribution and curation code allows any raw pool to be processed in the same way.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data balancing over a fixed concept vocabulary may generalize to other contrastive vision-language training regimes beyond CLIP.
  • If metadata balancing is the dominant factor, then further gains could be obtained by refining the concept list rather than simply increasing raw data volume.
  • The approach provides a concrete path toward more transparent and auditable large-scale vision-language datasets that do not rely on undisclosed filtering steps.

Load-bearing premise

That metadata derived from CLIP concepts captures the essential distributional properties responsible for the original CLIP data's effectiveness, and that balancing over this metadata is the main cause of the observed gains.

What would settle it

Run the released MetaCLIP curation code on CommonCrawl to produce the 400-million-pair dataset, train a ViT-B model from scratch under the paper's exact settings, and check whether zero-shot ImageNet accuracy reaches or exceeds 70.8 percent.

read the original abstract

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Metadata-Curated Language-Image Pre-training (MetaCLIP), which starts from a raw CommonCrawl pool and applies metadata derived from CLIP concepts to produce a balanced 400M (or 1B) image-text subset. Holding model architecture, optimizer, and training budget fixed, the authors report that MetaCLIP data yields higher zero-shot performance than CLIP's original data, e.g., 70.8% vs. 68.3% ImageNet top-1 accuracy on ViT-B and 72.4% at 1B scale, with similar gains across ViT sizes and other benchmarks. Curation code is released.

Significance. If the gains are attributable to the explicit balancing step, the work supplies a reproducible, open recipe for curating CLIP-scale data and clarifies a key (but previously opaque) ingredient behind CLIP's success. The public release of the curation pipeline is a concrete strength that lowers barriers for follow-on research.

major comments (2)
  1. [Experiments section] Experiments section: the central performance claim (70.8% vs. 68.3% ImageNet) compares the full MetaCLIP pipeline against CLIP's published numbers but provides no control that isolates metadata balancing. A random 400M draw from the identical CommonCrawl pool, or an unbalanced subset using the same metadata vocabulary, is required to establish that balancing (rather than incidental pool properties or unmeasured filtering) drives the observed delta.
  2. [§3 (Method)] §3 (Method): the precise definition of the metadata category set, the target distribution used for balancing, and the exact sequence of filtering steps applied before/after balancing are described at a high level only. Without these details, it is difficult to reproduce the pipeline or rule out confounding factors in the reported gains.
minor comments (2)
  1. [Abstract] Abstract and §1: the assertion that data is 'the main ingredient' to CLIP's success is stated categorically; a more measured phrasing ('a primary ingredient') would better reflect the controlled but still partial nature of the experiments.
  2. [Results tables] Results tables: report standard deviations or run-to-run variability for all accuracy numbers to support the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to improve experimental rigor and methodological clarity.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section: the central performance claim (70.8% vs. 68.3% ImageNet) compares the full MetaCLIP pipeline against CLIP's published numbers but provides no control that isolates metadata balancing. A random 400M draw from the identical CommonCrawl pool, or an unbalanced subset using the same metadata vocabulary, is required to establish that balancing (rather than incidental pool properties or unmeasured filtering) drives the observed delta.

    Authors: We agree that an explicit control isolating the balancing step would strengthen the central claim. In the revised manuscript we have added results comparing MetaCLIP to (i) a random 400M draw from the identical CommonCrawl pool and (ii) an unbalanced subset drawn using the same metadata vocabulary but without the balancing step. These controls confirm that balancing, rather than pool properties or other filtering, accounts for the observed gains. The Experiments section has been updated accordingly. revision: yes

  2. Referee: [§3 (Method)] §3 (Method): the precise definition of the metadata category set, the target distribution used for balancing, and the exact sequence of filtering steps applied before/after balancing are described at a high level only. Without these details, it is difficult to reproduce the pipeline or rule out confounding factors in the reported gains.

    Authors: We have expanded §3 with the precise definitions: the metadata category set consists of the 400K CLIP concepts, the target distribution is uniform over these categories, and we now detail the full sequence of pre- and post-balancing filters (including deduplication, quality thresholds, and language-image alignment criteria). In addition, the publicly released curation code at https://github.com/facebookresearch/MetaCLIP implements the exact pipeline with all parameters, enabling full reproduction. revision: yes

Circularity Check

0 steps flagged

Empirical data curation from external pool shows no circular reduction

full rationale

The paper describes an empirical curation procedure (MetaCLIP) that ingests an external raw pool (CommonCrawl) and a fixed metadata vocabulary derived from the original CLIP paper's concepts, then applies explicit balancing to produce a training subset. Models are trained from scratch under controlled settings and evaluated on standard benchmarks against CLIP's published numbers. No equation, prediction, or central claim reduces by construction to a fitted parameter, self-citation chain, or redefinition of the input; the performance delta is measured via independent runs on held-out data. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach assumes that CLIP's success is driven primarily by data distribution properties that can be recovered via metadata balancing, and that the chosen metadata categories derived from CLIP concepts are representative without introducing new biases.

free parameters (1)
  • metadata category set
    The set of concepts used to derive metadata is taken from CLIP's training vocabulary; no explicit fitting value is stated but the choice directly controls the balancing target.
axioms (1)
  • domain assumption Balancing image-text pairs over a metadata distribution derived from CLIP concepts produces a higher-quality training set than the original CLIP curation.
    Invoked in the abstract as the core hypothesis that the paper tests experimentally.

pith-pipeline@v0.9.0 · 5606 in / 1367 out tokens · 20900 ms · 2026-05-16T09:16:41.611459+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • HierarchyEmergence hierarchy_emergence_forces_phi echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    MetaCLIP takes a raw data pool and metadata (derived from CLIP’s concepts) and yields a balanced subset over the metadata distribution. ... MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP’s data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP’s 68.3% on ViT-B models.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

    cs.CV 2026-04 unverdicted novelty 8.0

    MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.

  2. Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

    cs.CV 2026-05 unverdicted novelty 7.0

    Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...

  3. DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.

  4. Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.

  5. When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

    cs.CV 2026-03 unverdicted novelty 7.0

    A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.

  6. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 unverdicted novelty 6.0

    Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.

  7. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 conditional novelty 6.0

    Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

  8. Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...

  9. LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

    cs.CV 2026-05 unverdicted novelty 6.0

    LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...

  10. Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

    cs.CV 2026-04 conditional novelty 6.0

    CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted...

  11. Vision Transformers Need More Than Registers

    cs.CV 2026-02 unverdicted novelty 6.0

    ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, tex...

  12. Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

    cs.CV 2026-02 conditional novelty 6.0

    Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.

  13. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  14. ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

    cs.CV 2026-05 unverdicted novelty 5.0

    ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.

  15. Let ViT Speak: Generative Language-Image Pre-training

    cs.CV 2026-05 unverdicted novelty 5.0

    GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.

  16. From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

    cs.CV 2026-04 unverdicted novelty 5.0

    VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.

  17. Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

    cs.LG 2026-04 unverdicted novelty 5.0

    Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.

  18. Human-Inspired Context-Selective Multimodal Memory for Social Robots

    cs.AI 2026-04 unverdicted novelty 5.0

    A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.

  19. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

Reference graph

Works this paper leans on

179 extracted references · 179 canonical work pages · cited by 18 Pith papers · 32 internal anchors

  1. [2]

    Coresets for nonparametric estimation-the case of dp-means

    Olivier Bachem, Mario Lucic, and Andreas Krause. Coresets for nonparametric estimation-the case of dp-means. In International Conference on Machine Learning, pp.\ 209--217. PMLR, 2015

  2. [4]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020

  3. [5]

    Scalable training of mixture models via coresets

    Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. Advances in neural information processing systems, 24, 2011

  4. [6]

    Datacomp: In search of the next generation of multimodal datasets, 2023

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...

  5. [7]

    On coresets for k-means and k-median clustering

    Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp.\ 291--300, 2004

  6. [8]

    Two-phase clustering process for outliers detection

    Mon-Fong Jiang, Shian-Shyong Tseng, and Chih-Ming Su. Two-phase clustering process for outliers detection. Pattern recognition letters, 22 0 (6-7): 0 691--700, 2001

  7. [9]

    Coresets for data-efficient training of machine learning models

    Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp.\ 6950--6960. PMLR, 2020

  8. [10]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

  9. [12]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Cade W Gordon, Ross Wightman, Theo Coombes, Aarush Katta, Clayton Mullis, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. 2022

  10. [13]

    Beyond neural scaling laws: beating power law scaling via data pruning

    Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35: 0 19523--19536, 2022

  11. [16]

    Findout: Finding outliers in very large datasets

    Dantong Yu, Gholamhosein Sheikholeslami, and Aidong Zhang. Findout: Finding outliers in very large datasets. Knowledge and information Systems, 4: 0 387--412, 2002

  12. [17]

    arXiv preprint arXiv:2306.16527 , year=

    OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. arXiv preprint arXiv:2306.16527 , year=

  13. [18]

    arXiv preprint arXiv:2304.06939 , year=

    Multimodal c4: An open, billion-scale corpus of images interleaved with text , author=. arXiv preprint arXiv:2304.06939 , year=

  14. [19]

    arXiv preprint arXiv:2308.12284 , year=

    D4: Improving LLM Pretraining via Document De-Duplication and Diversification , author=. arXiv preprint arXiv:2308.12284 , year=

  15. [20]

    International conference on machine learning , pages=

    Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  16. [21]

    2023 , eprint=

    Improving Multimodal Datasets with Image Captioning , author=. 2023 , eprint=

  17. [22]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  18. [23]

    arXiv preprint arXiv:2303.09540 , year=

    SemDeDup: Data-efficient learning at web-scale through semantic deduplication , author=. arXiv preprint arXiv:2303.09540 , year=

  19. [24]

    arXiv preprint arXiv:1812.05159 , year=

    An empirical study of example forgetting during deep neural network learning , author=. arXiv preprint arXiv:1812.05159 , year=

  20. [25]

    International Conference on Machine Learning , pages=

    Coresets for data-efficient training of machine learning models , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  21. [26]

    Advances in Neural Information Processing Systems , volume=

    Beyond neural scaling laws: beating power law scaling via data pruning , author=. Advances in Neural Information Processing Systems , volume=

  22. [27]

    International Conference on Machine Learning , pages=

    Coresets for nonparametric estimation-the case of DP-means , author=. International Conference on Machine Learning , pages=. 2015 , organization=

  23. [28]

    Advances in neural information processing systems , volume=

    Scalable training of mixture models via coresets , author=. Advances in neural information processing systems , volume=

  24. [29]

    Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=

    On coresets for k-means and k-median clustering , author=. Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=

  25. [30]

    Knowledge and information Systems , volume=

    Findout: Finding outliers in very large datasets , author=. Knowledge and information Systems , volume=. 2002 , publisher=

  26. [31]

    Pattern recognition letters , volume=

    Two-phase clustering process for outliers detection , author=. Pattern recognition letters , volume=. 2001 , publisher=

  27. [32]

    doi:10.5281/zenodo.5143773 , url =

    Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

  28. [33]

    arXiv preprint arXiv:2301.02241 , year=

    CiT: Curation in Training for Effective Vision-Language Data , author=. arXiv preprint arXiv:2301.02241 , year=

  29. [34]

    2023 , eprint=

    DataComp: In search of the next generation of multimodal datasets , author=. 2023 , eprint=

  30. [35]

    arXiv preprint arXiv:2212.07143 , year=

    Reproducible scaling laws for contrastive language-image learning , author=. arXiv preprint arXiv:2212.07143 , year=

  31. [36]

    arXiv preprint arXiv:2212.00794 , year=

    Scaling Language-Image Pre-training via Masking , author=. arXiv preprint arXiv:2212.00794 , year=

  32. [37]

    Beyond neural scaling laws: beating power law scaling via data pruning , author=

  33. [38]

    arXiv preprint arXiv:2207.07635 , year=

    Is a caption worth a thousand images? a controlled study for representation learning , author=. arXiv preprint arXiv:2207.07635 , year=

  34. [39]

    arXiv preprint arXiv:2010.00747 , year=

    Contrastive learning of medical visual representations from paired images and text , author=. arXiv preprint arXiv:2010.00747 , year=

  35. [40]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Revisiting Weakly Supervised Pre-Training of Visual Perception Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  36. [41]

    International conference on machine learning , pages=

    Submodularity in data subset selection and active learning , author=. International conference on machine learning , pages=. 2015 , organization=

  37. [42]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Glister: Generalization based data subset selection for efficient and robust learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  38. [43]

    arXiv preprint arXiv:2104.07705 , year=

    How to train bert with an academic budget , author=. arXiv preprint arXiv:2104.07705 , year=

  39. [44]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Unsupervised feature learning via non-parametric instance discrimination , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  40. [45]

    Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

    Socratic models: Composing zero-shot multimodal reasoning with language , author=. arXiv preprint arXiv:2204.00598 , year=

  41. [46]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Flamingo: a visual language model for few-shot learning , author=. arXiv preprint arXiv:2204.14198 , year=

  42. [47]

    International Conference on Machine Learning , pages=

    Nlp from scratch without large-scale pretraining: A simple and efficient framework , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  43. [48]

    arXiv preprint arXiv:2004.09733 , year=

    Train no evil: Selective masking for task-guided pre-training , author=. arXiv preprint arXiv:2004.09733 , year=

  44. [49]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

  45. [50]

    ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

    Pre-training Text Encoders as Discriminators Rather Than Generators , author=. Preprint at https://arxiv. org/abs/2003.10555 , year=

  46. [51]

    Advances in Neural Information Processing Systems , volume=

    Searching for Efficient Transformers for Language Modeling , author=. Advances in Neural Information Processing Systems , volume=

  47. [52]

    arXiv preprint arXiv:2101.00063 , year=

    Earlybert: Efficient bert training via early-bird lottery tickets , author=. arXiv preprint arXiv:2101.00063 , year=

  48. [53]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

  49. [54]

    Reducing BERT pre-training time from 3 days to 76 minutes

    Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

  50. [55]

    BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

  51. [56]

    Bioinformatics , volume=

    BioBERT: a pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=

  52. [57]

    arXiv preprint arXiv:2009.04984 , year=

    Task-specific objectives of pre-trained language models for dialogue adaptation , author=. arXiv preprint arXiv:2009.04984 , year=

  53. [58]

    arXiv preprint arXiv:2004.10964 , year=

    Don't stop pretraining: adapt language models to domains and tasks , author=. arXiv preprint arXiv:2004.10964 , year=

  54. [59]

    Curriculum Learning for Domain Adaptation in Neural Machine Translation

    Curriculum learning for domain adaptation in neural machine translation , author=. arXiv preprint arXiv:1905.05816 , year=

  55. [60]

    arXiv preprint arXiv:2110.05208 , year=

    Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm , author=. arXiv preprint arXiv:2110.05208 , year=

  56. [61]

    LAION-5B: An open large-scale dataset for training next generation image-text models , author=

  57. [62]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Laion-400m: Open dataset of clip-filtered 400 million image-text pairs , author=. arXiv preprint arXiv:2111.02114 , year=

  58. [63]

    Florence: A New Foundation Model for Computer Vision

    Florence: A new foundation model for computer vision , author=. arXiv preprint arXiv:2111.11432 , year=

  59. [64]

    Proceedings of the 14th ACM International Conference on Web Search and Data Mining , pages=

    Pretrained transformers for text ranking: BERT and beyond , author=. Proceedings of the 14th ACM International Conference on Web Search and Data Mining , pages=

  60. [65]

    Transactions of the Association for Computational Linguistics , volume=

    Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

  61. [66]

    Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Reading Wikipedia to Answer Open-Domain Questions , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  62. [67]

    , author=

    The trec-8 question answering track report. , author=. Trec , volume=

  63. [68]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Coca: Contrastive captioners are image-text foundation models , author=. arXiv preprint arXiv:2205.01917 , year=

  64. [69]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    SimCSE: Simple Contrastive Learning of Sentence Embeddings , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  65. [70]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  66. [71]

    International Conference on Machine Learning , pages=

    Scaling up visual and vision-language representation learning with noisy text supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  67. [72]

    Communications of the ACM , volume=

    YFCC100M: The new data in multimedia research , author=. Communications of the ACM , volume=. 2016 , publisher=

  68. [73]

    Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

    Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1238

  69. [74]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  70. [75]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Robust fine-tuning of zero-shot models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  71. [76]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    An empirical study of training self-supervised vision transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  72. [77]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  73. [78]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Lit: Zero-shot transfer with locked-image text tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  74. [79]

    arXiv preprint arXiv:2112.12750 , year=

    Slip: Self-supervision meets language-image pre-training , author=. arXiv preprint arXiv:2112.12750 , year=

  75. [80]

    arXiv preprint arXiv:2106.10270 , year=

    How to train your vit? data, augmentation, and regularization in vision transformers , author=. arXiv preprint arXiv:2106.10270 , year=

  76. [81]

    arXiv preprint arXiv:2112.04482 , year=

    FLAVA: A Foundational Language And Vision Alignment Model , author=. arXiv preprint arXiv:2112.04482 , year=

  77. [82]

    On the Opportunities and Risks of Foundation Models

    On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

  78. [83]

    Scaling vision transformers

    Scaling vision transformers , author=. arXiv preprint arXiv:2106.04560 , year=

  79. [84]

    International Conference on Learning Representations , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

  80. [85]

    BEiT: BERT Pre-Training of Image Transformers

    BEiT: BERT Pre-Training of Image Transformers , author=. arXiv preprint arXiv:2106.08254 , year=

Showing first 80 references.