pith. machine review for the scientific record. sign in

arxiv: 2303.09540 · v3 · pith:O6FNK2VNnew · submitted 2023-03-16 · 💻 cs.LG · cs.AI· cs.CV

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Pith reviewed 2026-05-18 02:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords semantic deduplicationdata efficiencyweb-scale datasetsLAIONembedding similaritytraining speedupout-of-distribution generalization
0
0 comments X

The pith

Removing semantic duplicates from web-scale datasets preserves model performance while halving training time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Web-scale datasets contain many examples that are semantically similar even if not identical. SemDeDup uses embeddings from pre-trained models to locate and discard these semantic duplicates. On a LAION subset, this removes 50 percent of the data while causing only minimal drops in standard benchmark performance. Training time drops in proportion. Out-of-distribution accuracy actually rises, and similar gains appear when the method is applied to language models trained on C4.

Core claim

SemDeDup identifies semantic duplicates by comparing embeddings from pre-trained models and removes redundant data points, showing that up to half the examples in a LAION subset can be discarded with little loss on in-distribution tasks and gains on out-of-distribution tasks.

What carries the argument

Embeddings from pre-trained models to measure semantic similarity and remove duplicate data points.

If this is right

  • Training can finish in roughly half the time on the reduced dataset.
  • Out-of-distribution performance can rise after semantic duplicates are removed.
  • The same deduplication step improves results over prior methods on partially curated text datasets such as C4.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Iterative application of the method as embeddings improve during training could remove even more redundancy without extra cost.
  • The result suggests that current scaling gains depend more on semantic coverage than on raw example count.
  • Combining this embedding-based step with exact duplicate removal could produce still smaller yet equally effective datasets.

Load-bearing premise

Pre-trained embeddings reliably mark semantic duplicates without discarding unique information needed for the target task.

What would settle it

Train the same model on the full LAION subset versus the SemDeDup version and check whether accuracy on standard test sets falls by more than a few percent or out-of-distribution gains vanish.

read the original abstract

Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates: data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover, performance increases out of distribution. Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SemDeDup, a method that uses embeddings from pre-trained models to identify and remove semantic duplicates (semantically similar but not identical pairs) from large web-scale datasets. On a LAION subset, it claims that removing 50% of the data via this approach yields minimal in-distribution performance loss while halving training time and improving out-of-distribution performance. On the C4 dataset, it reports efficiency gains and improvements over prior deduplication methods for language model training. The work positions this as a simple way to leverage quality embeddings for faster learning with less data.

Significance. If the central empirical claims hold under tighter controls, the result would be significant for data-efficient training at scale: it offers a practical route to prune redundancy in uncurated web data without sacrificing performance, directly addressing compute costs for large models. The concrete 50% data reduction on LAION with preserved accuracy and OOD gains, plus the C4 comparison, provide actionable evidence of efficiency benefits. The approach also illustrates how off-the-shelf embeddings can be repurposed for dataset curation.

major comments (3)
  1. [§4] §4 (LAION experiments): the central claim that 50% data removal incurs 'minimal performance loss' is load-bearing for the paper's contribution, yet the manuscript provides no details on the exact semantic similarity threshold chosen, whether it was tuned on held-out data or the evaluation set, or results across a range of thresholds; without this, the result cannot be assessed for robustness or sensitivity to the free parameter.
  2. [§4] §4 (both LAION and C4 experiments): no statistical significance tests, error bars from multiple random seeds, or confirmation that data splits were fixed prior to any hyperparameter or threshold selection are reported; this directly affects confidence in the reported performance preservation, training speedups, and OOD improvements.
  3. [Method section and §4] Method section and §4: the assumption that cosine similarity (or equivalent) in a fixed pre-trained embedding space identifies pairs that contribute no unique gradient signal to the downstream loss is not tested via ablation on embedding model choice or qualitative inspection of retained vs. removed examples; this leaves the skeptic concern unaddressed and makes the deduplication rule's validity for the target task an open question.
minor comments (2)
  1. [Abstract] Abstract: the statement 'performance increases out of distribution' should specify the exact OOD datasets, tasks, and metrics to allow readers to evaluate the scope of the generalization claim.
  2. [Method] Throughout: the similarity measure and embedding model (e.g., CLIP or other) should be named explicitly with a reference in the method description rather than left implicit.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We appreciate the emphasis on improving the robustness, statistical rigor, and validation of our empirical claims. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [§4] §4 (LAION experiments): the central claim that 50% data removal incurs 'minimal performance loss' is load-bearing for the paper's contribution, yet the manuscript provides no details on the exact semantic similarity threshold chosen, whether it was tuned on held-out data or the evaluation set, or results across a range of thresholds; without this, the result cannot be assessed for robustness or sensitivity to the free parameter.

    Authors: We agree that the threshold selection process requires explicit documentation to allow proper assessment of robustness. The threshold was selected solely to achieve an approximate 50% data reduction based on the distribution of cosine similarities computed over the training subset, and this decision was made prior to any access to held-out or evaluation data. In the revised manuscript we will report the precise threshold value, state that it was not tuned on the evaluation set, and add performance results across a range of thresholds to demonstrate sensitivity. revision: yes

  2. Referee: [§4] §4 (both LAION and C4 experiments): no statistical significance tests, error bars from multiple random seeds, or confirmation that data splits were fixed prior to any hyperparameter or threshold selection are reported; this directly affects confidence in the reported performance preservation, training speedups, and OOD improvements.

    Authors: We concur that the absence of variability estimates and explicit split-fixation statements limits confidence in the results. We will rerun the LAION and C4 experiments with multiple random seeds, report means with standard deviations, add error bars to all relevant plots, and include an explicit statement that data splits were fixed before any threshold selection or hyperparameter decisions. These updates will appear in the revised version. revision: yes

  3. Referee: [Method section and §4] Method section and §4: the assumption that cosine similarity (or equivalent) in a fixed pre-trained embedding space identifies pairs that contribute no unique gradient signal to the downstream loss is not tested via ablation on embedding model choice or qualitative inspection of retained vs. removed examples; this leaves the skeptic concern unaddressed and makes the deduplication rule's validity for the target task an open question.

    Authors: This concern about the validity of the deduplication criterion is well-taken. While the original submission focused on a single embedding model, we will add an ablation comparing SemDeDup performance when using alternative pre-trained embedding models. We will also include qualitative examples of retained versus removed pairs to illustrate the semantic content being filtered. These additions will be incorporated in the revised manuscript to better substantiate the underlying assumption. revision: yes

Circularity Check

0 steps flagged

Empirical results from held-out training runs are independent of the deduplication procedure

full rationale

The paper presents SemDeDup as a practical method that applies pre-trained embeddings to remove semantic duplicates, then validates its utility through direct training experiments on LAION subsets and C4. Reported outcomes—50% data removal with minimal performance loss, halved training time, and out-of-distribution gains—are measured by running actual model training on the deduplicated data rather than being algebraically derived from the similarity threshold or embedding function. No equations, fitted parameters, or self-citations reduce these performance numbers to tautological inputs; the evaluation remains external and falsifiable via the training runs themselves. The core assumption about embedding quality is acknowledged as an empirical premise but is not smuggled in as a mathematical necessity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on the quality of off-the-shelf embeddings and a tunable similarity threshold; no new physical entities or mathematical axioms are introduced.

free parameters (1)
  • semantic similarity threshold
    A cutoff value must be chosen to decide when two embeddings count as duplicates; this is tuned to achieve the reported 50% removal rate.
axioms (1)
  • domain assumption Pre-trained model embeddings capture semantic similarity sufficiently well for deduplication decisions
    The entire pipeline rests on this property of existing embeddings.

pith-pipeline@v0.9.0 · 5709 in / 1256 out tokens · 32315 ms · 2026-05-18T02:39:15.079648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

    cs.LG 2026-05 conditional novelty 7.0

    AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.

  2. OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.

  3. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 conditional novelty 6.0

    Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

  4. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 unverdicted novelty 6.0

    Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.

  5. Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.

  6. Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

    cs.CV 2026-04 unverdicted novelty 6.0

    DynamiCS dynamically scales semantic clusters per training epoch to reduce VLM pre-training compute while improving accuracy on long-tail concepts compared to static or flattening baselines.

  7. Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

    cs.LG 2026-04 unverdicted novelty 6.0

    RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...

  8. Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems

    cs.LG 2026-04 unverdicted novelty 6.0

    MOSAIC is a scaling-aware data selection framework that outperforms baselines in training end-to-end autonomous driving planners, achieving comparable or better EPDMS scores with up to 80% less data.

  9. DataComp-LM: In search of the next generation of training sets for language models

    cs.LG 2024-06 unverdicted novelty 6.0

    DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

  10. Demystifying CLIP Data

    cs.CV 2023-09 accept novelty 6.0

    MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

  11. LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution

    cs.CR 2026-05 unverdicted novelty 5.0

    LCC-LLM creates a code-centric dataset and RAG-based LLM framework that reaches 0.634 average semantic similarity on 43 malware tasks and 10/10 pass rate in real-world case studies.

  12. Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment

    cs.SE 2026-04 unverdicted novelty 5.0

    DepTrans translates entire C repositories to Rust at 60.7% compilation success and 43.5% functional accuracy by combining reinforcement-aligned syntax training with dependency-guided iterative refinement.

  13. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  14. Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference

    cs.CL 2026-05 unverdicted novelty 4.0

    Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.

  15. Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks

    cs.CL 2026-05 unverdicted novelty 4.0

    Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.

  16. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    cs.CL 2025-08 unverdicted novelty 4.0

    GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

  17. Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

    cs.CL 2025-09 unverdicted novelty 3.0

    Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.

  18. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 17 Pith papers · 14 internal anchors

  1. [1]

    Deep Learning Scaling is Predictable, Empirically

    J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. Patwary, M. Ali, Y . Yang, and Y . Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017

  2. [2]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  3. [3]

    Scaling Laws for Autoregressive Generative Modeling

    T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhari- wal, S. Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020

  4. [4]

    J. S. Rosenfeld, A. Rosenfeld, Y . Belinkov, and N. Shavit. A constructive prediction of the generalization error across scales. International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=ryenvpEKDr

  5. [5]

    M. A. Gordon, K. Duh, and J. Kaplan. Data and parameter scaling laws for neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Lan- 13 guage Processing, pages 5915–5922, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics

  6. [6]

    Scaling Laws for Transfer

    D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021

  7. [7]

    X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. arXiv preprint arXiv:2106.04560, 2021

  8. [8]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. v. d. Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/...

  9. [10]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning , pages 8748–8763. PMLR, 2021

  10. [11]

    Ilharco, M

    G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V . Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below

  11. [12]

    Aghajanyan, L

    A. Aghajanyan, L. Yu, A. Conneau, W.-N. Hsu, K. Hambardzumyan, S. Zhang, S. Roller, N. Goyal, O. Levy, and L. Zettlemoyer. Scaling laws for generative mixed-modal language models, 2023. URL https://arxiv.org/abs/2301.03728

  12. [13]

    Sorscher, R

    B. Sorscher, R. Geirhos, S. Shekhar, S. Ganguli, and A. S. Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. In Neural Information Processing Systems (NeurIPS), June 2022

  13. [14]

    Radenovic, A

    F. Radenovic, A. Dubey, A. Kadian, T. Mihaylov, S. Vandenhende, Y . Patel, Y . Wen, V . Ra- manathan, and D. Mahajan. Filtering, distillation, and hard negatives for vision-language pre-training. arXiv preprint arXiv:2301.02280, 2023

  14. [15]

    Feldman and C

    V . Feldman and C. Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Adv. Neural Inf. Process. Syst., 33:2881–2891, 2020

  15. [16]

    Y . Liao. Dataset Deduplication with Datamodels . PhD thesis, Massachusetts Institute of Technology, May 2022

  16. [17]

    Thomee, D

    B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. YFCC100M: the new data in multimedia research. Commun. ACM, 59(2):64–73, Jan. 2016

  17. [18]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019. URL https://arxiv.org/abs/1910.10683

  18. [20]

    A. Broder. On the resemblance and containment of documents. 06 1997. doi:10.1109/SEQUEN. 1997.666900

  19. [21]

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. v. d. Driessche, L. A. Hendricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen,...

  20. [22]

    Kandpal, E

    N. Kandpal, E. Wallace, and C. Raffel. Deduplicating training data mitigates privacy risks in language models, 2022. URL https://arxiv.org/abs/2202.06539

  21. [23]

    Silcock, L

    E. Silcock, L. D’Amico-Wong, J. Yang, and M. Dell. Noise-Robust De-Duplication at scale. Dec. 2022

  22. [24]

    Choi, D.-S

    W.-S. Choi, D.-S. Han, H. Lee, J. Park, and B.-T. Zhang. DUEL: Adaptive duplicate elimination on working memory for Self-Supervised learning. Oct. 2022

  23. [25]

    C. Guo, B. Zhao, and Y . Bai. DeepCore: A comprehensive library for coreset selection in deep learning. Apr. 2022

  24. [26]

    J. M. Phillips. Coresets and sketches. Jan. 2016

  25. [27]

    Toneva, A

    M. Toneva, A. Sordoni, R. T. des Combes, A. Trischler, Y . Bengio, and G. J. Gordon. An empirical study of example forgetting during deep neural network learning. In ICLR, 2019

  26. [28]

    M. Paul, S. Ganguli, and G. K. Dziugaite. Deep learning on a data diet: Finding important examples early in training. Adv. Neural Inf. Process. Syst., 34, Dec. 2021

  27. [29]

    Chitta, J

    K. Chitta, J. M. Álvarez, E. Haussmann, and C. Farabet. Training data subset search with ensemble active learning. IEEE Trans. Intell. Transp. Syst., pages 1–12, 2021

  28. [30]

    Meding, L

    K. Meding, L. M. S. Buschoff, R. Geirhos, and F. A. Wichmann. Trivial or impossible— dichotomous data difficulty masks model differences (on ImageNet and beyond). In Inter- national Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=C_vsGwEIjAr

  29. [31]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

  30. [32]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer. Opt: Open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068

  31. [33]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  32. [34]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022

  33. [35]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

  34. [36]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017. 15

  35. [37]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  36. [38]

    Hendrycks, K

    D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song. Natural adversarial examples. CVPR, 2021

  37. [39]

    Hendrycks, S

    D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021

  38. [40]

    H. Wang, S. Ge, Z. Lipton, and E. P. Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems , pages 10506– 10518, 2019

  39. [41]

    Recht, R

    B. Recht, R. Roelofs, L. Schmidt, and V . Shankar. Do ImageNet classifiers generalize to ImageNet? In K. Chaudhuri and R. Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5389–5400. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/ recht19a.html

  40. [42]

    Barbu, D

    A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recogni- tion models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d 'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 32. Cur- ...

  41. [43]

    K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini. Dedupli- cating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021

  42. [44]

    OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

    S. Iyer, X. V . Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, X. Li, B. O’Horo, G. Pereyra, J. Wang, C. Dewan, A. Celikyilmaz, L. Zettlemoyer, and V . Stoyanov. Opt-iml: Scaling language model instruction meta learning through the lens of generalization, 2022. URL https://arxiv.org/abs/2212.12017. 16 A Additiona...