arxiv: 2406.17557 · v2 · submitted 2024-06-25 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Anton Lozhkov, Colin Raffel, Guilherme Penedo, Hynek Kydl\'i\v{c}ek, Leandro Von Werra, Loubna Ben Allal, Margaret Mitchell, Thomas Wolf

Pith reviewed 2026-05-13 04:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords pretraining datasetsCommon Crawldata filteringdeduplicationlarge language modelsFineWebFineWeb-Edueducational text

0 comments

The pith

A carefully filtered 15-trillion token dataset from Common Crawl produces superior LLMs compared to other open pretraining collections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create FineWeb by processing 96 snapshots of Common Crawl data with extensive deduplication and filtering. They demonstrate through experiments that models trained on this dataset achieve higher performance than those using other public datasets. They further extract FineWeb-Edu, an educational subset, which significantly improves results on benchmarks focused on knowledge and reasoning. The full curation code and trained models are released to support further research into effective pretraining data.

Core claim

FineWeb is a 15 trillion token dataset derived from Common Crawl that, when used for pretraining, results in large language models that perform better than those trained on existing open datasets. FineWeb-Edu is a 1.3 trillion token educational subset that leads to dramatically improved performance on benchmarks such as MMLU and ARC. The design choices in curation, including deduplication and filtering, are documented and ablated to understand their contributions.

What carries the argument

The curation pipeline consisting of deduplication and filtering strategies applied across multiple Common Crawl snapshots to produce high-quality text data.

If this is right

Pretrained LLMs on FineWeb show better overall performance than on other open datasets.
FineWeb-Edu leads to substantial gains on knowledge and reasoning benchmarks like MMLU and ARC.
Detailed ablations highlight the effects of specific filtering and deduplication techniques.
Public release of code and models enables community replication and extension of the curation methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar curation techniques could be applied to improve datasets in other languages or domains beyond English web text.
The success of educational filtering suggests that domain-specific subsets may be key for targeted model capabilities.
Releasing the full pipeline encourages standardized evaluation of data quality in LLM pretraining.

Load-bearing premise

The observed improvements in model performance are primarily attributable to the quality of the curated data rather than differences in model architecture, training duration, or optimization settings.

What would settle it

Re-training the same models with identical hyperparameters and for the same number of tokens on FineWeb versus a competing open dataset like The Pile, then comparing the resulting benchmark scores; equal performance would falsify the claim that FineWeb is superior due to curation.

read the original abstract

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces FineWeb, a 15-trillion-token dataset derived from 96 Common Crawl snapshots, and FineWeb-Edu, a 1.3-trillion-token educational subset filtered from it. It claims that LLMs pretrained on FineWeb outperform those trained on prior open corpora (e.g., C4, The Pile) on standard benchmarks, with FineWeb-Edu yielding particularly large gains on knowledge- and reasoning-intensive tasks such as MMLU and ARC. The work documents and ablates all curation choices (deduplication, filtering), releases the full datasets, the curation codebase, and all ablation-trained models.

Significance. If the reported gains are attributable to the curation pipeline rather than uncontrolled differences in training configuration, the contribution would be substantial: it supplies both a new large-scale open pretraining resource and concrete, reproducible evidence on the impact of specific filtering and deduplication decisions. The public release of the complete codebase, all ablation models, and the datasets themselves is a clear strength that enables direct verification and extension by the community.

major comments (1)

The central performance claims (Abstract; cross-corpus comparisons) rest on the assumption that training setups for FineWeb/FineWeb-Edu runs are identical to those used for the C4, The Pile, and other baseline corpora in terms of model architecture, parameter count, total tokens processed, batch size, optimizer, and learning-rate schedule. The paper provides internal ablations of its own filtering/deduplication choices and releases those models, but does not explicitly document or verify equivalence for the external baseline comparisons; any mismatch would prevent isolating the effect of the curation pipeline.

minor comments (1)

The abstract and results sections would benefit from a concise table summarizing the exact training hyperparameters (tokens, model size, optimizer settings) used for each compared dataset to make the equivalence claim immediately verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful and constructive review. The concern about ensuring and documenting equivalence in training setups for cross-corpus comparisons is well-taken, and we address it directly below. We will revise the manuscript to make this aspect fully explicit.

read point-by-point responses

Referee: The central performance claims (Abstract; cross-corpus comparisons) rest on the assumption that training setups for FineWeb/FineWeb-Edu runs are identical to those used for the C4, The Pile, and other baseline corpora in terms of model architecture, parameter count, total tokens processed, batch size, optimizer, and learning-rate schedule. The paper provides internal ablations of its own filtering/deduplication choices and releases those models, but does not explicitly document or verify equivalence for the external baseline comparisons; any mismatch would prevent isolating the effect of the curation pipeline.

Authors: We agree that explicit documentation is essential for isolating the effect of the curation pipeline. All models reported in the cross-corpus comparisons—including those trained on C4, The Pile, and the other baselines—were pretrained from scratch using an identical setup: the same 1.3B-parameter decoder-only transformer architecture, the same total token count, batch size, AdamW optimizer, and learning-rate schedule (with the same number of training steps). This controlled setup is described in Section 4 and Appendix C, and the full training code plus all resulting models (including the baseline runs) have been released to enable direct verification. We acknowledge that the manuscript does not state this equivalence as clearly or prominently as it should for the external baselines. In the revised version we will add an explicit paragraph in the Experiments section, together with a summary table of shared hyperparameters, to remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical dataset curation and model training results

full rationale

The paper's core claims rest on constructing FineWeb from Common Crawl snapshots via documented filtering/deduplication steps, then empirically training and evaluating LLMs on it (and on FineWeb-Edu) against baselines like C4 and The Pile. Performance deltas on MMLU/ARC are presented as measured outcomes from those runs, with ablations and released models allowing independent verification. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain; the results are externally falsifiable via the public data and code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Work is empirical and relies on standard practices for web data processing and benchmark evaluation; no free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5525 in / 1032 out tokens · 89600 ms · 2026-05-13T04:29:47.181250+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Causal Language Modeling Detour Improves Encoder Continued Pretraining
cs.CL 2026-05 conditional novelty 7.0

A temporary CLM phase followed by MLM decay during encoder continued pretraining outperforms standard MLM on biomedical tasks by 0.3-2.8pp across languages and model sizes.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
cs.LG 2026-05 unverdicted novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
cs.LG 2026-05 unverdicted novelty 7.0

BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.
Layer Collapse in Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
Layer Collapse in Diffusion Language Models
cs.LG 2026-05 conditional novelty 7.0

Early layers in diffusion language models like LLaDA-8B collapse into redundant representations around a critical super-outlier activation due to overtraining, making them more robust to quantization and sparsity than...
Projection-Free Transformers via Gaussian Kernel Attention
cs.LG 2026-05 unverdicted novelty 7.0

Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
cs.LG 2026-05 conditional novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes
cs.CL 2026-05 unverdicted novelty 6.0

Fixed 16-bit binary token codes can replace trainable input embeddings in 32-layer decoder-only models while maintaining comparable held-out perplexity on 17B tokens.
Dimension-Free Saddle-Point Escape in Muon
cs.LG 2026-05 unverdicted novelty 6.0

Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
Sparse Layers are Critical to Scaling Looped Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
cs.LG 2026-05 unverdicted novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
stat.ML 2026-05 unverdicted novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
cs.CL 2026-04 conditional novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
Finding Belief Geometries with Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 6.0

A new pipeline identifies candidate simplex geometries in Gemma-2-9B representations, with five clusters showing significant barycentric prediction advantages consistent with belief-state encoding.
Metriplector: From Field Theory to Neural Architecture
cs.AI 2026-03 unverdicted novelty 6.0

Metriplector treats neural computation as coupled metriplectic field dynamics whose stress-energy tensor readout achieves competitive results on vision, control, Sudoku, language modeling, and pathfinding with small p...
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model
cs.LG 2026-04 unverdicted novelty 5.0

Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
cs.LG 2026-04 unverdicted novelty 5.0

Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
cs.LG 2026-04 unverdicted novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction
cs.CL 2026-05 unverdicted novelty 4.0

GLiNER-Relex unifies NER and RE in one zero-shot transformer-based model that achieves competitive results on CoNLL04, DocRED, FewRel, and CrossRE.
Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
cs.CL 2026-05 unverdicted novelty 4.0

Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
cs.CL 2026-05 unverdicted novelty 4.0

Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
cs.DC 2026-05 unverdicted novelty 4.0

CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over ...
Language corpora for the Dutch medical domain
cs.CL 2026-04 unverdicted novelty 4.0

A 35-billion-token Dutch medical corpus was assembled from translated, mined, and extracted sources and released publicly on Hugging Face as the first large-scale resource of its kind.
XekRung Technical Report
cs.CR 2026-04 unverdicted novelty 3.0

XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 30 Pith papers · 10 internal anchors

[1]

Rae, Oriol Vinyals, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page 2022
[2]

Llama: Open and efficient foundation language models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

work page 2023
[3]

Language Models are Few-Shot Learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024

work page internal anchor Pith review arXiv 2024
[5]

Deduplicating training data makes language models better, 2022

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better, 2022

work page 2022
[6]

Llama 3 model card, 2024

AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md

work page 2024
[7]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page 2024
[8]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical re- port: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023

work page internal anchor Pith review arXiv 2023
[10]

Measuring massive multitask language understanding, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

work page 2021
[11]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018
[12]

Datatrove: large scale data processing, 2024

Guilherme Penedo, Hynek Kydlí ˇcek, Alessandro Cappelli, Mario Sasko, and Thomas Wolf. Datatrove: large scale data processing, 2024. URL https://github.com/huggingface/ datatrove

work page 2024
[13]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[14]

Universal language model fine-tuning for text classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text clas- sification. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 328–339, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10....

work page doi:10.18653/v1/p18-1031 2018
[15]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023

work page 2023
[16]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

work page 2022
[17]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240): 1–113, 2023

work page 2023
[18]

Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022

work page internal anchor Pith review arXiv 2022
[19]

Trinh and Quoc V

Trieu H. Trinh and Quoc V . Le. A simple method for commonsense reasoning, 2019

work page 2019
[20]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[21]

https://platform.openai.com/docs/gptbot

Openai gptbot documentation. https://platform.openai.com/docs/gptbot. Accessed: 2024-06-05

work page 2024
[22]

https://darkvisitors.com/agents/claudebot

Claudebot documentation. https://darkvisitors.com/agents/claudebot. Accessed: 2024-06-05

work page 2024
[23]

https://commoncrawl.org/

Common crawl. https://commoncrawl.org/. Accessed: 2024-06-05

work page 2024
[24]

Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel

Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models, 2023

work page 2023
[25]

doi: 10.14618/ ids-pub-10468

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, and Benoît Sagot. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, art. arXiv:2201.06642, January 2022

work page arXiv 2022
[26]

FastText.zip: Compressing text classification models

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. Fasttext.zip: Compressing text classification models.arXiv preprint arXiv:1612.03651, 2016

work page Pith review arXiv 2016
[27]

https://pypi.org/project/langdetect/

Langdetect library. https://pypi.org/project/langdetect/. Accessed: 2024-06-05

work page 2024
[28]

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019

work page arXiv 1911
[29]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[30]

https://pypi.org/project/pycld2/

pycld2 library. https://pypi.org/project/pycld2/. Accessed: 2024-06-05

work page 2024
[31]

https://pypi.org/project/jusText/

justext library. https://pypi.org/project/jusText/. Accessed: 2024-06-05

work page 2024
[32]

On the resemblance and containment of documents

Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Com- pression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE, 1997. 12

work page 1997
[33]

The bigscience roots corpus: A 1.6 tb composite multilingual dataset

Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro V on Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826, 2022

work page 2022
[34]

Redpajama: an open dataset for training large language models, October

Together Computer. Redpajama: an open dataset for training large language models, October

work page
[35]

URL https://github.com/togethercomputer/RedPajama-Data

work page
[36]

Slimpajama: A 627b token cleaned and deduplicated version of redpajama, June 2023

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. Slimpajama: A 627b token cleaned and deduplicated version of redpajama, June 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B

work page 2023
[37]

The Refined- Web dataset for Falcon LLM: Outperforming curated corpora with web data only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The Refined- Web dataset for Falcon LLM: Outperforming curated corpora with web data only. In Advances in Neural Information Processing Systems, 2024

work page 2024
[38]

https://pypi.org/project/trafilatura/

Trafilatura library. https://pypi.org/project/trafilatura/. Accessed: 2024-06-05

work page 2024
[39]

Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Rus- sell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichand...

work page 2024
[40]

Content extraction using diverse feature sets

Matthew E Peters and Dan Lecocq. Content extraction using diverse feature sets. InProceedings of the 22nd international conference on world wide web, pages 89–90, 2013

work page 2013
[41]

https://github.com/codelucas/newspaper

Newspaper library. https://github.com/codelucas/newspaper. Accessed: 2024-06-05

work page 2024
[42]

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, ...

work page 2022
[43]

doi: 10.18653/v1/N19-1421

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A ques- tion answering challenge targeting commonsense knowledge. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapol...

work page doi:10.18653/v1/n19-1421 2019
[44]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, edi- tors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Lingui...

work page doi:10.18653/v1/p19-1472 2019
[45]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018

work page 2018
[46]

Piqa: Reasoning about physical commonsense in natural language, 2019

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019

work page 2019
[47]

Socialiqa: Com- monsense reasoning about social interactions, 2019

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com- monsense reasoning about social interactions, 2019

work page 2019
[48]

Winogrande: An adversarial winograd schema challenge at scale, 2019

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019

work page 2019
[49]

https://htmlparser.sourceforge.net/

htmlparser library. https://htmlparser.sourceforge.net/. Accessed: 2024-06-05

work page 2024
[50]

Trafilatura: A web scraping library and command-line tool for text discovery and extraction

Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 122–131. Association for...

work page 2021
[51]

The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

work page 2023
[52]

https://dsi.ut-capitole.fr/blacklists/

Ut1 url blacklists. https://dsi.ut-capitole.fr/blacklists/. Accessed: 2024-06-05

work page 2024
[53]

Bag of tricks for efficient text classification

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016

work page arXiv 2016
[54]

Quantifying memorization across neural language models, 2023

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models, 2023

work page 2023
[55]

Deduplicating training data mitigates privacy risks in language models

Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, 2022

work page 2022
[56]

Suffix arrays: a new method for on-line string searches

Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993

work page 1993
[57]

A survey on data selection for language models, 2024

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. A survey on data selection for language models, 2024

work page 2024
[58]

O’Reilly Media, Inc

Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009

work page 2009
[59]

Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Rus- sell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichand...

work page 2024
[60]

Map-neo: Highly capable and transparent bilingual large language model series, 2024

Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu, Noah Wang, Quehry Que, Ruibo Liu, Sine Liu, Shawn...

work page arXiv 2024
[61]

Proceedings of the Association for Computational Linguistics (ACL) , pages =

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoy- anov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meet- ing of th...

work page doi:10.18653/v1/2020.acl-main.747 2020
[62]

Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures

Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim, 2019. Leibniz-Institut für Deutsche Sprache. doi: 10. 14618/i...

work page 2019
[63]

https://huggingface.co/meta-llama/ Meta-Llama-3-70B-Instruct

Meta llama 3 70b instruct. https://huggingface.co/meta-llama/ Meta-Llama-3-70B-Instruct . Accessed: 2024-06-05

work page 2024
[64]

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024

work page internal anchor Pith review arXiv 2024
[65]

Self-alignment with instruction backtranslation

Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason We- ston, and Mike Lewis. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023

work page arXiv 2023
[66]

Arctic-embed: Scalable, efficient, and accurate text embedding models

Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. Arctic-embed: Scalable, efficient, and accurate text embedding models. arXiv preprint arXiv:2405.05374, 2024

work page arXiv 2024
[67]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[68]

text-clustering, 2024

Leandro V on Werra. text-clustering, 2024. URL https://github.com/huggingface/ text-clustering

work page 2024
[69]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/ abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[70]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction, 2020. URL https://arxiv.org/abs/1802.03426

work page internal anchor Pith review Pith/arXiv arXiv 2020
[71]

A density-based algorithm for discovering clusters in large spatial databases with noise

Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231, 1996

work page 1996
[72]

and Richardson, Kyle and Dodge, Jesse , year =

Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, A. Jha, Oyvind Tafjord, Dustin Schwenk, Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hanna Ha- jishirzi, Noah A. Smith, Kyle Richardson, and Jesse Dodge. Paloma: A benchmark for evaluating language model fit. ArXiv, abs/2312.10523, 2023. URL https://api.semanticscholar. org...

work page arXiv 2023
[73]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10...

work page doi:10.1145/3442188.3445922 2021
[74]

Costa-jussà, and Noe Casas

Christine Basta, Marta R. Costa-jussà, and Noe Casas. Evaluating the underlying gender bias in contextualized word embeddings. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 33–39, 2019

work page 2019
[75]

Black, and Yulia Tsvetkov

Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W. Black, and Yulia Tsvetkov. Measuring bias in contextualized word representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 166–172, 2019. 15

work page 2019
[76]

The woman worked as a babysitter: On biases in language generation

Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3407–3412, 2019

work page 2019
[77]

Gender bias in contextualized word embeddings

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. Gender bias in contextualized word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 629–634, 2019

work page 2019
[78]

From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models

Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11737–11762, 2023

work page 2023
[79]

A statistical interpretation of term specificity and its application in retrieval

Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972

work page 1972
[80]

The Lolita effect: The media sexualization of young girls and what we can do about it

M Gigi Durham. The Lolita effect: The media sexualization of young girls and what we can do about it. Abrams, 2009

work page 2009

Showing first 80 references.