arxiv: 2309.16671 · v6 · submitted 2023-09-28 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

Demystifying CLIP Data

Hu Xu , Saining Xie , Xiaoqing Ellen Tan , Po-Yao Huang , Russell Howes , Vasu Sharma , Shang-Wen Li , Gargi Ghosh

show 2 more authors

Luke Zettlemoyer Christoph Feichtenhofer

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:16 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords MetaCLIPCLIPdata curationimage-text pairszero-shot classificationCommonCrawlcontrastive pre-trainingvision-language models

0 comments

The pith

MetaCLIP balances CommonCrawl image-text pairs using CLIP-derived metadata to exceed original CLIP performance on zero-shot benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that CLIP's effectiveness comes mainly from its data curation process rather than model architecture or loss. It introduces MetaCLIP, which extracts metadata from CLIP concepts and applies explicit balancing over that distribution to select a high-quality subset from a raw pool such as CommonCrawl. With 400 million pairs the resulting dataset reaches 70.8 percent zero-shot ImageNet accuracy on ViT-B models, compared with 68.3 percent for the original CLIP data. Scaling the curated set to one billion pairs lifts accuracy to 72.4 percent under the same training budget, and the gains persist across model scales including ViT-H at 80.5 percent. The work releases the curation code and metadata distribution to make the process reproducible.

Core claim

MetaCLIP takes a raw data pool and metadata derived from CLIP's concepts and yields a balanced subset over the metadata distribution. When applied to CommonCrawl, the 400-million-pair version outperforms the original CLIP data on multiple standard benchmarks; zero-shot ImageNet accuracy rises from 68.3 percent to 70.8 percent on ViT-B, and scaling to one billion pairs reaches 72.4 percent while holding training compute fixed.

What carries the argument

Metadata-Curated Language-Image Pre-training (MetaCLIP), a curation procedure that extracts concepts from CLIP and explicitly balances image-text pairs over the resulting metadata distribution.

If this is right

Zero-shot ImageNet accuracy improves from 68.3 percent to 70.8 percent on ViT-B without any change to model size or training schedule.
Scaling the balanced dataset to one billion pairs produces an additional lift to 72.4 percent under identical compute.
The same curation yields consistent gains across model capacities, reaching 80.5 percent with ViT-H.
Releasing the metadata distribution and curation code allows any raw pool to be processed in the same way.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data balancing over a fixed concept vocabulary may generalize to other contrastive vision-language training regimes beyond CLIP.
If metadata balancing is the dominant factor, then further gains could be obtained by refining the concept list rather than simply increasing raw data volume.
The approach provides a concrete path toward more transparent and auditable large-scale vision-language datasets that do not rely on undisclosed filtering steps.

Load-bearing premise

That metadata derived from CLIP concepts captures the essential distributional properties responsible for the original CLIP data's effectiveness, and that balancing over this metadata is the main cause of the observed gains.

What would settle it

Run the released MetaCLIP curation code on CommonCrawl to produce the 400-million-pair dataset, train a ViT-B model from scratch under the paper's exact settings, and check whether zero-shot ImageNet accuracy reaches or exceeds 70.8 percent.

read the original abstract

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Metadata-Curated Language-Image Pre-training (MetaCLIP), which starts from a raw CommonCrawl pool and applies metadata derived from CLIP concepts to produce a balanced 400M (or 1B) image-text subset. Holding model architecture, optimizer, and training budget fixed, the authors report that MetaCLIP data yields higher zero-shot performance than CLIP's original data, e.g., 70.8% vs. 68.3% ImageNet top-1 accuracy on ViT-B and 72.4% at 1B scale, with similar gains across ViT sizes and other benchmarks. Curation code is released.

Significance. If the gains are attributable to the explicit balancing step, the work supplies a reproducible, open recipe for curating CLIP-scale data and clarifies a key (but previously opaque) ingredient behind CLIP's success. The public release of the curation pipeline is a concrete strength that lowers barriers for follow-on research.

major comments (2)

[Experiments section] Experiments section: the central performance claim (70.8% vs. 68.3% ImageNet) compares the full MetaCLIP pipeline against CLIP's published numbers but provides no control that isolates metadata balancing. A random 400M draw from the identical CommonCrawl pool, or an unbalanced subset using the same metadata vocabulary, is required to establish that balancing (rather than incidental pool properties or unmeasured filtering) drives the observed delta.
[§3 (Method)] §3 (Method): the precise definition of the metadata category set, the target distribution used for balancing, and the exact sequence of filtering steps applied before/after balancing are described at a high level only. Without these details, it is difficult to reproduce the pipeline or rule out confounding factors in the reported gains.

minor comments (2)

[Abstract] Abstract and §1: the assertion that data is 'the main ingredient' to CLIP's success is stated categorically; a more measured phrasing ('a primary ingredient') would better reflect the controlled but still partial nature of the experiments.
[Results tables] Results tables: report standard deviations or run-to-run variability for all accuracy numbers to support the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to improve experimental rigor and methodological clarity.

read point-by-point responses

Referee: [Experiments section] Experiments section: the central performance claim (70.8% vs. 68.3% ImageNet) compares the full MetaCLIP pipeline against CLIP's published numbers but provides no control that isolates metadata balancing. A random 400M draw from the identical CommonCrawl pool, or an unbalanced subset using the same metadata vocabulary, is required to establish that balancing (rather than incidental pool properties or unmeasured filtering) drives the observed delta.

Authors: We agree that an explicit control isolating the balancing step would strengthen the central claim. In the revised manuscript we have added results comparing MetaCLIP to (i) a random 400M draw from the identical CommonCrawl pool and (ii) an unbalanced subset drawn using the same metadata vocabulary but without the balancing step. These controls confirm that balancing, rather than pool properties or other filtering, accounts for the observed gains. The Experiments section has been updated accordingly. revision: yes
Referee: [§3 (Method)] §3 (Method): the precise definition of the metadata category set, the target distribution used for balancing, and the exact sequence of filtering steps applied before/after balancing are described at a high level only. Without these details, it is difficult to reproduce the pipeline or rule out confounding factors in the reported gains.

Authors: We have expanded §3 with the precise definitions: the metadata category set consists of the 400K CLIP concepts, the target distribution is uniform over these categories, and we now detail the full sequence of pre- and post-balancing filters (including deduplication, quality thresholds, and language-image alignment criteria). In addition, the publicly released curation code at https://github.com/facebookresearch/MetaCLIP implements the exact pipeline with all parameters, enabling full reproduction. revision: yes

Circularity Check

0 steps flagged

Empirical data curation from external pool shows no circular reduction

full rationale

The paper describes an empirical curation procedure (MetaCLIP) that ingests an external raw pool (CommonCrawl) and a fixed metadata vocabulary derived from the original CLIP paper's concepts, then applies explicit balancing to produce a training subset. Models are trained from scratch under controlled settings and evaluated on standard benchmarks against CLIP's published numbers. No equation, prediction, or central claim reduces by construction to a fitted parameter, self-citation chain, or redefinition of the input; the performance delta is measured via independent runs on held-out data. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach assumes that CLIP's success is driven primarily by data distribution properties that can be recovered via metadata balancing, and that the chosen metadata categories derived from CLIP concepts are representative without introducing new biases.

free parameters (1)

metadata category set
The set of concepts used to derive metadata is taken from CLIP's training vocabulary; no explicit fitting value is stated but the choice directly controls the balancing target.

axioms (1)

domain assumption Balancing image-text pairs over a metadata distribution derived from CLIP concepts produces a higher-quality training set than the original CLIP curation.
Invoked in the abstract as the core hypothesis that the paper tests experimentally.

pith-pipeline@v0.9.0 · 5606 in / 1367 out tokens · 20900 ms · 2026-05-16T09:16:41.611459+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

HierarchyEmergence hierarchy_emergence_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

MetaCLIP takes a raw data pool and metadata (derived from CLIP’s concepts) and yields a balanced subset over the metadata distribution. ... MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP’s data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP’s 68.3% on ViT-B models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
cs.CV 2026-04 unverdicted novelty 8.0

MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
cs.CV 2026-05 unverdicted novelty 7.0

Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...
DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
MMSearch-R1: Incentivizing LMMs to Search
cs.CV 2025-06 unverdicted novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 unverdicted novelty 6.0

Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
cs.LG 2026-05 conditional novelty 6.0

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
cs.CV 2026-05 unverdicted novelty 6.0

Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
cs.CV 2026-05 unverdicted novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics
cs.CV 2026-04 conditional novelty 6.0

CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted...
Vision Transformers Need More Than Registers
cs.CV 2026-02 unverdicted novelty 6.0

ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, tex...
Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models
cs.CV 2026-02 conditional novelty 6.0

Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
cs.CV 2026-05 unverdicted novelty 5.0

ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
Let ViT Speak: Generative Language-Image Pre-training
cs.CV 2026-05 unverdicted novelty 5.0

GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
cs.CV 2026-04 unverdicted novelty 5.0

VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
cs.LG 2026-04 unverdicted novelty 5.0

Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
Human-Inspired Context-Selective Multimodal Memory for Social Robots
cs.AI 2026-04 unverdicted novelty 5.0

A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

Reference graph

Works this paper leans on

179 extracted references · 179 canonical work pages · cited by 19 Pith papers · 32 internal anchors

[2]

Coresets for nonparametric estimation-the case of dp-means

Olivier Bachem, Mario Lucic, and Andreas Krause. Coresets for nonparametric estimation-the case of dp-means. In International Conference on Machine Learning, pp.\ 209--217. PMLR, 2015

work page 2015
[4]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020

work page 2020
[5]

Scalable training of mixture models via coresets

Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. Advances in neural information processing systems, 24, 2011

work page 2011
[6]

Datacomp: In search of the next generation of multimodal datasets, 2023

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...

work page 2023
[7]

On coresets for k-means and k-median clustering

Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp.\ 291--300, 2004

work page 2004
[8]

Two-phase clustering process for outliers detection

Mon-Fong Jiang, Shian-Shyong Tseng, and Chih-Ming Su. Two-phase clustering process for outliers detection. Pattern recognition letters, 22 0 (6-7): 0 691--700, 2001

work page 2001
[9]

Coresets for data-efficient training of machine learning models

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp.\ 6950--6960. PMLR, 2020

work page 2020
[10]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021
[12]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Cade W Gordon, Ross Wightman, Theo Coombes, Aarush Katta, Clayton Mullis, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. 2022

work page 2022
[13]

Beyond neural scaling laws: beating power law scaling via data pruning

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35: 0 19523--19536, 2022

work page 2022
[16]

Findout: Finding outliers in very large datasets

Dantong Yu, Gholamhosein Sheikholeslami, and Aidong Zhang. Findout: Finding outliers in very large datasets. Knowledge and information Systems, 4: 0 387--412, 2002

work page 2002
[17]

arXiv preprint arXiv:2306.16527 , year=

OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. arXiv preprint arXiv:2306.16527 , year=

work page arXiv
[18]

arXiv preprint arXiv:2304.06939 , year=

Multimodal c4: An open, billion-scale corpus of images interleaved with text , author=. arXiv preprint arXiv:2304.06939 , year=

work page arXiv
[19]

arXiv preprint arXiv:2308.12284 , year=

D4: Improving LLM Pretraining via Document De-Duplication and Diversification , author=. arXiv preprint arXiv:2308.12284 , year=

work page arXiv
[20]

International conference on machine learning , pages=

Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[21]

2023 , eprint=

Improving Multimodal Datasets with Image Captioning , author=. 2023 , eprint=

work page 2023
[22]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[23]

arXiv preprint arXiv:2303.09540 , year=

SemDeDup: Data-efficient learning at web-scale through semantic deduplication , author=. arXiv preprint arXiv:2303.09540 , year=

work page arXiv
[24]

arXiv preprint arXiv:1812.05159 , year=

An empirical study of example forgetting during deep neural network learning , author=. arXiv preprint arXiv:1812.05159 , year=

work page arXiv
[25]

International Conference on Machine Learning , pages=

Coresets for data-efficient training of machine learning models , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[26]

Advances in Neural Information Processing Systems , volume=

Beyond neural scaling laws: beating power law scaling via data pruning , author=. Advances in Neural Information Processing Systems , volume=

work page
[27]

International Conference on Machine Learning , pages=

Coresets for nonparametric estimation-the case of DP-means , author=. International Conference on Machine Learning , pages=. 2015 , organization=

work page 2015
[28]

Advances in neural information processing systems , volume=

Scalable training of mixture models via coresets , author=. Advances in neural information processing systems , volume=

work page
[29]

Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=

On coresets for k-means and k-median clustering , author=. Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=

work page
[30]

Knowledge and information Systems , volume=

Findout: Finding outliers in very large datasets , author=. Knowledge and information Systems , volume=. 2002 , publisher=

work page 2002
[31]

Pattern recognition letters , volume=

Two-phase clustering process for outliers detection , author=. Pattern recognition letters , volume=. 2001 , publisher=

work page 2001
[32]

doi:10.5281/zenodo.5143773 , url =

Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

work page doi:10.5281/zenodo.5143773
[33]

arXiv preprint arXiv:2301.02241 , year=

CiT: Curation in Training for Effective Vision-Language Data , author=. arXiv preprint arXiv:2301.02241 , year=

work page arXiv
[34]

2023 , eprint=

DataComp: In search of the next generation of multimodal datasets , author=. 2023 , eprint=

work page 2023
[35]

arXiv preprint arXiv:2212.07143 , year=

Reproducible scaling laws for contrastive language-image learning , author=. arXiv preprint arXiv:2212.07143 , year=

work page arXiv
[36]

Scaling language-image pre-training via masking

Scaling Language-Image Pre-training via Masking , author=. arXiv preprint arXiv:2212.00794 , year=

work page arXiv
[37]

Beyond neural scaling laws: beating power law scaling via data pruning , author=

work page
[38]

arXiv preprint arXiv:2207.07635 , year=

Is a caption worth a thousand images? a controlled study for representation learning , author=. arXiv preprint arXiv:2207.07635 , year=

work page arXiv
[39]

arXiv preprint arXiv:2010.00747 , year=

Contrastive learning of medical visual representations from paired images and text , author=. arXiv preprint arXiv:2010.00747 , year=

work page arXiv 2010
[40]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Revisiting Weakly Supervised Pre-Training of Visual Perception Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[41]

International conference on machine learning , pages=

Submodularity in data subset selection and active learning , author=. International conference on machine learning , pages=. 2015 , organization=

work page 2015
[42]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Glister: Generalization based data subset selection for efficient and robust learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[43]

arXiv preprint arXiv:2104.07705 , year=

How to train bert with an academic budget , author=. arXiv preprint arXiv:2104.07705 , year=

work page arXiv
[44]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Unsupervised feature learning via non-parametric instance discrimination , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[45]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Socratic models: Composing zero-shot multimodal reasoning with language , author=. arXiv preprint arXiv:2204.00598 , year=

work page internal anchor Pith review arXiv
[46]

Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo: a visual language model for few-shot learning , author=. arXiv preprint arXiv:2204.14198 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

International Conference on Machine Learning , pages=

Nlp from scratch without large-scale pretraining: A simple and efficient framework , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[48]

arXiv preprint arXiv:2004.09733 , year=

Train no evil: Selective masking for task-guided pre-training , author=. arXiv preprint arXiv:2004.09733 , year=

work page arXiv 2004
[49]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[50]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Pre-training Text Encoders as Discriminators Rather Than Generators , author=. Preprint at https://arxiv. org/abs/2003.10555 , year=

work page internal anchor Pith review arXiv 2003
[51]

Advances in Neural Information Processing Systems , volume=

Searching for Efficient Transformers for Language Modeling , author=. Advances in Neural Information Processing Systems , volume=

work page
[52]

arXiv preprint arXiv:2101.00063 , year=

Earlybert: Efficient bert training via early-bird lottery tickets , author=. arXiv preprint arXiv:2101.00063 , year=

work page arXiv
[53]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[54]

Reducing BERT pre-training time from 3 days to 76 minutes

Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page arXiv 1904
[55]

BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

work page 2019
[56]

Bioinformatics , volume=

BioBERT: a pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=

work page 2020
[57]

arXiv preprint arXiv:2009.04984 , year=

Task-specific objectives of pre-trained language models for dialogue adaptation , author=. arXiv preprint arXiv:2009.04984 , year=

work page arXiv 2009
[58]

arXiv preprint arXiv:2004.10964 , year=

Don't stop pretraining: adapt language models to domains and tasks , author=. arXiv preprint arXiv:2004.10964 , year=

work page arXiv 2004
[59]

Curriculum Learning for Domain Adaptation in Neural Machine Translation

Curriculum learning for domain adaptation in neural machine translation , author=. arXiv preprint arXiv:1905.05816 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905
[60]

arXiv preprint arXiv:2110.05208 , year=

Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm , author=. arXiv preprint arXiv:2110.05208 , year=

work page arXiv
[61]

LAION-5B: An open large-scale dataset for training next generation image-text models , author=

work page
[62]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs , author=. arXiv preprint arXiv:2111.02114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Florence: A New Foundation Model for Computer Vision

Florence: A new foundation model for computer vision , author=. arXiv preprint arXiv:2111.11432 , year=

work page internal anchor Pith review arXiv
[64]

Proceedings of the 14th ACM International Conference on Web Search and Data Mining , pages=

Pretrained transformers for text ranking: BERT and beyond , author=. Proceedings of the 14th ACM International Conference on Web Search and Data Mining , pages=

work page
[65]

Transactions of the Association for Computational Linguistics , volume=

Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

work page 2019
[66]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Reading Wikipedia to Answer Open-Domain Questions , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[67]

, author=

The trec-8 question answering track report. , author=. Trec , volume=

work page
[68]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Coca: Contrastive captioners are image-text foundation models , author=. arXiv preprint arXiv:2205.01917 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

SimCSE: Simple Contrastive Learning of Sentence Embeddings , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[70]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[71]

International Conference on Machine Learning , pages=

Scaling up visual and vision-language representation learning with noisy text supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[72]

Communications of the ACM , volume=

YFCC100M: The new data in multimedia research , author=. Communications of the ACM , volume=. 2016 , publisher=

work page 2016
[73]

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1238

work page doi:10.18653/v1/p18-1238 2018
[74]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[75]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Robust fine-tuning of zero-shot models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[76]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

An empirical study of training self-supervised vision transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[77]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[78]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Lit: Zero-shot transfer with locked-image text tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[79]

arXiv preprint arXiv:2112.12750 , year=

Slip: Self-supervision meets language-image pre-training , author=. arXiv preprint arXiv:2112.12750 , year=

work page arXiv
[80]

How to train your ViT? Data, augmentation, and regularization in vision transformers

How to train your vit? data, augmentation, and regularization in vision transformers , author=. arXiv preprint arXiv:2106.10270 , year=

work page arXiv
[81]

arXiv preprint arXiv:2112.04482 , year=

FLAVA: A Foundational Language And Vision Alignment Model , author=. arXiv preprint arXiv:2112.04482 , year=

work page arXiv
[82]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[83]

Scaling vision transformers

Scaling vision transformers , author=. arXiv preprint arXiv:2106.04560 , year=

work page arXiv
[84]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page
[85]

BEiT: BERT Pre-Training of Image Transformers

BEiT: BERT Pre-Training of Image Transformers , author=. arXiv preprint arXiv:2106.08254 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.