Recognition: 2 theorem links
Demystifying CLIP Data
Pith reviewed 2026-05-16 09:16 UTC · model grok-4.3
The pith
MetaCLIP balances CommonCrawl image-text pairs using CLIP-derived metadata to exceed original CLIP performance on zero-shot benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MetaCLIP takes a raw data pool and metadata derived from CLIP's concepts and yields a balanced subset over the metadata distribution. When applied to CommonCrawl, the 400-million-pair version outperforms the original CLIP data on multiple standard benchmarks; zero-shot ImageNet accuracy rises from 68.3 percent to 70.8 percent on ViT-B, and scaling to one billion pairs reaches 72.4 percent while holding training compute fixed.
What carries the argument
Metadata-Curated Language-Image Pre-training (MetaCLIP), a curation procedure that extracts concepts from CLIP and explicitly balances image-text pairs over the resulting metadata distribution.
If this is right
- Zero-shot ImageNet accuracy improves from 68.3 percent to 70.8 percent on ViT-B without any change to model size or training schedule.
- Scaling the balanced dataset to one billion pairs produces an additional lift to 72.4 percent under identical compute.
- The same curation yields consistent gains across model capacities, reaching 80.5 percent with ViT-H.
- Releasing the metadata distribution and curation code allows any raw pool to be processed in the same way.
Where Pith is reading between the lines
- Data balancing over a fixed concept vocabulary may generalize to other contrastive vision-language training regimes beyond CLIP.
- If metadata balancing is the dominant factor, then further gains could be obtained by refining the concept list rather than simply increasing raw data volume.
- The approach provides a concrete path toward more transparent and auditable large-scale vision-language datasets that do not rely on undisclosed filtering steps.
Load-bearing premise
That metadata derived from CLIP concepts captures the essential distributional properties responsible for the original CLIP data's effectiveness, and that balancing over this metadata is the main cause of the observed gains.
What would settle it
Run the released MetaCLIP curation code on CommonCrawl to produce the 400-million-pair dataset, train a ViT-B model from scratch under the paper's exact settings, and check whether zero-shot ImageNet accuracy reaches or exceeds 70.8 percent.
read the original abstract
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Metadata-Curated Language-Image Pre-training (MetaCLIP), which starts from a raw CommonCrawl pool and applies metadata derived from CLIP concepts to produce a balanced 400M (or 1B) image-text subset. Holding model architecture, optimizer, and training budget fixed, the authors report that MetaCLIP data yields higher zero-shot performance than CLIP's original data, e.g., 70.8% vs. 68.3% ImageNet top-1 accuracy on ViT-B and 72.4% at 1B scale, with similar gains across ViT sizes and other benchmarks. Curation code is released.
Significance. If the gains are attributable to the explicit balancing step, the work supplies a reproducible, open recipe for curating CLIP-scale data and clarifies a key (but previously opaque) ingredient behind CLIP's success. The public release of the curation pipeline is a concrete strength that lowers barriers for follow-on research.
major comments (2)
- [Experiments section] Experiments section: the central performance claim (70.8% vs. 68.3% ImageNet) compares the full MetaCLIP pipeline against CLIP's published numbers but provides no control that isolates metadata balancing. A random 400M draw from the identical CommonCrawl pool, or an unbalanced subset using the same metadata vocabulary, is required to establish that balancing (rather than incidental pool properties or unmeasured filtering) drives the observed delta.
- [§3 (Method)] §3 (Method): the precise definition of the metadata category set, the target distribution used for balancing, and the exact sequence of filtering steps applied before/after balancing are described at a high level only. Without these details, it is difficult to reproduce the pipeline or rule out confounding factors in the reported gains.
minor comments (2)
- [Abstract] Abstract and §1: the assertion that data is 'the main ingredient' to CLIP's success is stated categorically; a more measured phrasing ('a primary ingredient') would better reflect the controlled but still partial nature of the experiments.
- [Results tables] Results tables: report standard deviations or run-to-run variability for all accuracy numbers to support the claimed improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to improve experimental rigor and methodological clarity.
read point-by-point responses
-
Referee: [Experiments section] Experiments section: the central performance claim (70.8% vs. 68.3% ImageNet) compares the full MetaCLIP pipeline against CLIP's published numbers but provides no control that isolates metadata balancing. A random 400M draw from the identical CommonCrawl pool, or an unbalanced subset using the same metadata vocabulary, is required to establish that balancing (rather than incidental pool properties or unmeasured filtering) drives the observed delta.
Authors: We agree that an explicit control isolating the balancing step would strengthen the central claim. In the revised manuscript we have added results comparing MetaCLIP to (i) a random 400M draw from the identical CommonCrawl pool and (ii) an unbalanced subset drawn using the same metadata vocabulary but without the balancing step. These controls confirm that balancing, rather than pool properties or other filtering, accounts for the observed gains. The Experiments section has been updated accordingly. revision: yes
-
Referee: [§3 (Method)] §3 (Method): the precise definition of the metadata category set, the target distribution used for balancing, and the exact sequence of filtering steps applied before/after balancing are described at a high level only. Without these details, it is difficult to reproduce the pipeline or rule out confounding factors in the reported gains.
Authors: We have expanded §3 with the precise definitions: the metadata category set consists of the 400K CLIP concepts, the target distribution is uniform over these categories, and we now detail the full sequence of pre- and post-balancing filters (including deduplication, quality thresholds, and language-image alignment criteria). In addition, the publicly released curation code at https://github.com/facebookresearch/MetaCLIP implements the exact pipeline with all parameters, enabling full reproduction. revision: yes
Circularity Check
Empirical data curation from external pool shows no circular reduction
full rationale
The paper describes an empirical curation procedure (MetaCLIP) that ingests an external raw pool (CommonCrawl) and a fixed metadata vocabulary derived from the original CLIP paper's concepts, then applies explicit balancing to produce a training subset. Models are trained from scratch under controlled settings and evaluated on standard benchmarks against CLIP's published numbers. No equation, prediction, or central claim reduces by construction to a fitted parameter, self-citation chain, or redefinition of the input; the performance delta is measured via independent runs on held-out data. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- metadata category set
axioms (1)
- domain assumption Balancing image-text pairs over a metadata distribution derived from CLIP concepts produces a higher-quality training set than the original CLIP curation.
Lean theorems connected to this paper
-
HierarchyEmergencehierarchy_emergence_forces_phi echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
MetaCLIP takes a raw data pool and metadata (derived from CLIP’s concepts) and yields a balanced subset over the metadata distribution. ... MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP’s data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP’s 68.3% on ViT-B models.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
-
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and veri...
-
DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation
DouC fuses an OG-CLIP branch for patch reliability via inference-time token gating with an FADE-CLIP branch for structural priors via proxy attention, outperforming prior training-free methods on eight benchmarks.
-
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
-
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
-
MMSearch-R1: Incentivizing LMMs to Search
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation w...
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
-
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics
CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted...
-
Vision Transformers Need More Than Registers
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, tex...
-
Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models
Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
-
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
-
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
-
Human-Inspired Context-Selective Multimodal Memory for Social Robots
A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Reference graph
Works this paper leans on
-
[2]
Coresets for nonparametric estimation-the case of dp-means
Olivier Bachem, Mario Lucic, and Andreas Krause. Coresets for nonparametric estimation-the case of dp-means. In International Conference on Machine Learning, pp.\ 209--217. PMLR, 2015
work page 2015
-
[4]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020
work page 2020
-
[5]
Scalable training of mixture models via coresets
Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. Advances in neural information processing systems, 24, 2011
work page 2011
-
[6]
Datacomp: In search of the next generation of multimodal datasets, 2023
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...
work page 2023
-
[7]
On coresets for k-means and k-median clustering
Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp.\ 291--300, 2004
work page 2004
-
[8]
Two-phase clustering process for outliers detection
Mon-Fong Jiang, Shian-Shyong Tseng, and Chih-Ming Su. Two-phase clustering process for outliers detection. Pattern recognition letters, 22 0 (6-7): 0 691--700, 2001
work page 2001
-
[9]
Coresets for data-efficient training of machine learning models
Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp.\ 6950--6960. PMLR, 2020
work page 2020
-
[10]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021
work page 2021
-
[12]
Laion-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Cade W Gordon, Ross Wightman, Theo Coombes, Aarush Katta, Clayton Mullis, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. 2022
work page 2022
-
[13]
Beyond neural scaling laws: beating power law scaling via data pruning
Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35: 0 19523--19536, 2022
work page 2022
-
[16]
Findout: Finding outliers in very large datasets
Dantong Yu, Gholamhosein Sheikholeslami, and Aidong Zhang. Findout: Finding outliers in very large datasets. Knowledge and information Systems, 4: 0 387--412, 2002
work page 2002
-
[17]
arXiv preprint arXiv:2306.16527 , year=
OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. arXiv preprint arXiv:2306.16527 , year=
-
[18]
arXiv preprint arXiv:2304.06939 , year=
Multimodal c4: An open, billion-scale corpus of images interleaved with text , author=. arXiv preprint arXiv:2304.06939 , year=
-
[19]
arXiv preprint arXiv:2308.12284 , year=
D4: Improving LLM Pretraining via Document De-Duplication and Diversification , author=. arXiv preprint arXiv:2308.12284 , year=
-
[20]
International conference on machine learning , pages=
Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[21]
Improving Multimodal Datasets with Image Captioning , author=. 2023 , eprint=
work page 2023
-
[22]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[23]
arXiv preprint arXiv:2303.09540 , year=
SemDeDup: Data-efficient learning at web-scale through semantic deduplication , author=. arXiv preprint arXiv:2303.09540 , year=
-
[24]
arXiv preprint arXiv:1812.05159 , year=
An empirical study of example forgetting during deep neural network learning , author=. arXiv preprint arXiv:1812.05159 , year=
-
[25]
International Conference on Machine Learning , pages=
Coresets for data-efficient training of machine learning models , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[26]
Advances in Neural Information Processing Systems , volume=
Beyond neural scaling laws: beating power law scaling via data pruning , author=. Advances in Neural Information Processing Systems , volume=
-
[27]
International Conference on Machine Learning , pages=
Coresets for nonparametric estimation-the case of DP-means , author=. International Conference on Machine Learning , pages=. 2015 , organization=
work page 2015
-
[28]
Advances in neural information processing systems , volume=
Scalable training of mixture models via coresets , author=. Advances in neural information processing systems , volume=
-
[29]
Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=
On coresets for k-means and k-median clustering , author=. Proceedings of the thirty-sixth annual ACM symposium on Theory of computing , pages=
-
[30]
Knowledge and information Systems , volume=
Findout: Finding outliers in very large datasets , author=. Knowledge and information Systems , volume=. 2002 , publisher=
work page 2002
-
[31]
Pattern recognition letters , volume=
Two-phase clustering process for outliers detection , author=. Pattern recognition letters , volume=. 2001 , publisher=
work page 2001
-
[32]
doi:10.5281/zenodo.5143773 , url =
Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =
-
[33]
arXiv preprint arXiv:2301.02241 , year=
CiT: Curation in Training for Effective Vision-Language Data , author=. arXiv preprint arXiv:2301.02241 , year=
-
[34]
DataComp: In search of the next generation of multimodal datasets , author=. 2023 , eprint=
work page 2023
-
[35]
arXiv preprint arXiv:2212.07143 , year=
Reproducible scaling laws for contrastive language-image learning , author=. arXiv preprint arXiv:2212.07143 , year=
-
[36]
Scaling language-image pre-training via masking
Scaling Language-Image Pre-training via Masking , author=. arXiv preprint arXiv:2212.00794 , year=
-
[37]
Beyond neural scaling laws: beating power law scaling via data pruning , author=
-
[38]
arXiv preprint arXiv:2207.07635 , year=
Is a caption worth a thousand images? a controlled study for representation learning , author=. arXiv preprint arXiv:2207.07635 , year=
-
[39]
arXiv preprint arXiv:2010.00747 , year=
Contrastive learning of medical visual representations from paired images and text , author=. arXiv preprint arXiv:2010.00747 , year=
-
[40]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Revisiting Weakly Supervised Pre-Training of Visual Perception Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[41]
International conference on machine learning , pages=
Submodularity in data subset selection and active learning , author=. International conference on machine learning , pages=. 2015 , organization=
work page 2015
-
[42]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Glister: Generalization based data subset selection for efficient and robust learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[43]
arXiv preprint arXiv:2104.07705 , year=
How to train bert with an academic budget , author=. arXiv preprint arXiv:2104.07705 , year=
-
[44]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Unsupervised feature learning via non-parametric instance discrimination , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[45]
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Socratic models: Composing zero-shot multimodal reasoning with language , author=. arXiv preprint arXiv:2204.00598 , year=
work page internal anchor Pith review arXiv
-
[46]
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a visual language model for few-shot learning , author=. arXiv preprint arXiv:2204.14198 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
International Conference on Machine Learning , pages=
Nlp from scratch without large-scale pretraining: A simple and efficient framework , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[48]
arXiv preprint arXiv:2004.09733 , year=
Train no evil: Selective masking for task-guided pre-training , author=. arXiv preprint arXiv:2004.09733 , year=
-
[49]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[50]
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Pre-training Text Encoders as Discriminators Rather Than Generators , author=. Preprint at https://arxiv. org/abs/2003.10555 , year=
work page internal anchor Pith review arXiv 2003
-
[51]
Advances in Neural Information Processing Systems , volume=
Searching for Efficient Transformers for Language Modeling , author=. Advances in Neural Information Processing Systems , volume=
-
[52]
arXiv preprint arXiv:2101.00063 , year=
Earlybert: Efficient bert training via early-bird lottery tickets , author=. arXiv preprint arXiv:2101.00063 , year=
-
[53]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[54]
Reducing BERT pre-training time from 3 days to 76 minutes
Large batch optimization for deep learning: Training bert in 76 minutes , author=. arXiv preprint arXiv:1904.00962 , year=
-
[55]
BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=
work page 2019
-
[56]
BioBERT: a pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=
work page 2020
-
[57]
arXiv preprint arXiv:2009.04984 , year=
Task-specific objectives of pre-trained language models for dialogue adaptation , author=. arXiv preprint arXiv:2009.04984 , year=
-
[58]
arXiv preprint arXiv:2004.10964 , year=
Don't stop pretraining: adapt language models to domains and tasks , author=. arXiv preprint arXiv:2004.10964 , year=
-
[59]
Curriculum Learning for Domain Adaptation in Neural Machine Translation
Curriculum learning for domain adaptation in neural machine translation , author=. arXiv preprint arXiv:1905.05816 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[60]
arXiv preprint arXiv:2110.05208 , year=
Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm , author=. arXiv preprint arXiv:2110.05208 , year=
-
[61]
LAION-5B: An open large-scale dataset for training next generation image-text models , author=
-
[62]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Laion-400m: Open dataset of clip-filtered 400 million image-text pairs , author=. arXiv preprint arXiv:2111.02114 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
Florence: A New Foundation Model for Computer Vision
Florence: A new foundation model for computer vision , author=. arXiv preprint arXiv:2111.11432 , year=
work page internal anchor Pith review arXiv
-
[64]
Proceedings of the 14th ACM International Conference on Web Search and Data Mining , pages=
Pretrained transformers for text ranking: BERT and beyond , author=. Proceedings of the 14th ACM International Conference on Web Search and Data Mining , pages=
-
[65]
Transactions of the Association for Computational Linguistics , volume=
Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=
work page 2019
-
[66]
Reading Wikipedia to Answer Open-Domain Questions , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
- [67]
-
[68]
CoCa: Contrastive Captioners are Image-Text Foundation Models
Coca: Contrastive captioners are image-text foundation models , author=. arXiv preprint arXiv:2205.01917 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
SimCSE: Simple Contrastive Learning of Sentence Embeddings , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2021
-
[70]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2021
-
[71]
International Conference on Machine Learning , pages=
Scaling up visual and vision-language representation learning with noisy text supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[72]
Communications of the ACM , volume=
YFCC100M: The new data in multimedia research , author=. Communications of the ACM , volume=. 2016 , publisher=
work page 2016
-
[73]
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1238
-
[74]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[75]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Robust fine-tuning of zero-shot models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[76]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
An empirical study of training self-supervised vision transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[77]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[78]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Lit: Zero-shot transfer with locked-image text tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[79]
arXiv preprint arXiv:2112.12750 , year=
Slip: Self-supervision meets language-image pre-training , author=. arXiv preprint arXiv:2112.12750 , year=
-
[80]
How to train your ViT? Data, augmentation, and regularization in vision transformers
How to train your vit? data, augmentation, and regularization in vision transformers , author=. arXiv preprint arXiv:2106.10270 , year=
-
[81]
arXiv preprint arXiv:2112.04482 , year=
FLAVA: A Foundational Language And Vision Alignment Model , author=. arXiv preprint arXiv:2112.04482 , year=
-
[82]
On the Opportunities and Risks of Foundation Models
On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[83]
Scaling vision transformers , author=. arXiv preprint arXiv:2106.04560 , year=
-
[84]
International Conference on Learning Representations , year=
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
-
[85]
BEiT: BERT Pre-Training of Image Transformers
BEiT: BERT Pre-Training of Image Transformers , author=. arXiv preprint arXiv:2106.08254 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.