Recognition: 2 theorem links
· Lean TheoremThe FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Pith reviewed 2026-05-13 04:29 UTC · model grok-4.3
The pith
A carefully filtered 15-trillion token dataset from Common Crawl produces superior LLMs compared to other open pretraining collections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FineWeb is a 15 trillion token dataset derived from Common Crawl that, when used for pretraining, results in large language models that perform better than those trained on existing open datasets. FineWeb-Edu is a 1.3 trillion token educational subset that leads to dramatically improved performance on benchmarks such as MMLU and ARC. The design choices in curation, including deduplication and filtering, are documented and ablated to understand their contributions.
What carries the argument
The curation pipeline consisting of deduplication and filtering strategies applied across multiple Common Crawl snapshots to produce high-quality text data.
If this is right
- Pretrained LLMs on FineWeb show better overall performance than on other open datasets.
- FineWeb-Edu leads to substantial gains on knowledge and reasoning benchmarks like MMLU and ARC.
- Detailed ablations highlight the effects of specific filtering and deduplication techniques.
- Public release of code and models enables community replication and extension of the curation methods.
Where Pith is reading between the lines
- Similar curation techniques could be applied to improve datasets in other languages or domains beyond English web text.
- The success of educational filtering suggests that domain-specific subsets may be key for targeted model capabilities.
- Releasing the full pipeline encourages standardized evaluation of data quality in LLM pretraining.
Load-bearing premise
The observed improvements in model performance are primarily attributable to the quality of the curated data rather than differences in model architecture, training duration, or optimization settings.
What would settle it
Re-training the same models with identical hyperparameters and for the same number of tokens on FineWeb versus a competing open dataset like The Pile, then comparing the resulting benchmark scores; equal performance would falsify the claim that FineWeb is superior due to curation.
read the original abstract
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FineWeb, a 15-trillion-token dataset derived from 96 Common Crawl snapshots, and FineWeb-Edu, a 1.3-trillion-token educational subset filtered from it. It claims that LLMs pretrained on FineWeb outperform those trained on prior open corpora (e.g., C4, The Pile) on standard benchmarks, with FineWeb-Edu yielding particularly large gains on knowledge- and reasoning-intensive tasks such as MMLU and ARC. The work documents and ablates all curation choices (deduplication, filtering), releases the full datasets, the curation codebase, and all ablation-trained models.
Significance. If the reported gains are attributable to the curation pipeline rather than uncontrolled differences in training configuration, the contribution would be substantial: it supplies both a new large-scale open pretraining resource and concrete, reproducible evidence on the impact of specific filtering and deduplication decisions. The public release of the complete codebase, all ablation models, and the datasets themselves is a clear strength that enables direct verification and extension by the community.
major comments (1)
- The central performance claims (Abstract; cross-corpus comparisons) rest on the assumption that training setups for FineWeb/FineWeb-Edu runs are identical to those used for the C4, The Pile, and other baseline corpora in terms of model architecture, parameter count, total tokens processed, batch size, optimizer, and learning-rate schedule. The paper provides internal ablations of its own filtering/deduplication choices and releases those models, but does not explicitly document or verify equivalence for the external baseline comparisons; any mismatch would prevent isolating the effect of the curation pipeline.
minor comments (1)
- The abstract and results sections would benefit from a concise table summarizing the exact training hyperparameters (tokens, model size, optimizer settings) used for each compared dataset to make the equivalence claim immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review. The concern about ensuring and documenting equivalence in training setups for cross-corpus comparisons is well-taken, and we address it directly below. We will revise the manuscript to make this aspect fully explicit.
read point-by-point responses
-
Referee: The central performance claims (Abstract; cross-corpus comparisons) rest on the assumption that training setups for FineWeb/FineWeb-Edu runs are identical to those used for the C4, The Pile, and other baseline corpora in terms of model architecture, parameter count, total tokens processed, batch size, optimizer, and learning-rate schedule. The paper provides internal ablations of its own filtering/deduplication choices and releases those models, but does not explicitly document or verify equivalence for the external baseline comparisons; any mismatch would prevent isolating the effect of the curation pipeline.
Authors: We agree that explicit documentation is essential for isolating the effect of the curation pipeline. All models reported in the cross-corpus comparisons—including those trained on C4, The Pile, and the other baselines—were pretrained from scratch using an identical setup: the same 1.3B-parameter decoder-only transformer architecture, the same total token count, batch size, AdamW optimizer, and learning-rate schedule (with the same number of training steps). This controlled setup is described in Section 4 and Appendix C, and the full training code plus all resulting models (including the baseline runs) have been released to enable direct verification. We acknowledge that the manuscript does not state this equivalence as clearly or prominently as it should for the external baselines. In the revised version we will add an explicit paragraph in the Experiments section, together with a summary table of shared hyperparameters, to remove any ambiguity. revision: yes
Circularity Check
No significant circularity: empirical dataset curation and model training results
full rationale
The paper's core claims rest on constructing FineWeb from Common Crawl snapshots via documented filtering/deduplication steps, then empirically training and evaluating LLMs on it (and on FineWeb-Edu) against baselines like C4 and The Pile. Performance deltas on MMLU/ARC are presented as measured outcomes from those runs, with ablations and released models allowing independent verification. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain; the results are externally falsifiable via the public data and code.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 31 Pith papers
-
A Causal Language Modeling Detour Improves Encoder Continued Pretraining
A temporary CLM phase followed by MLM decay during encoder continued pretraining outperforms standard MLM on biomedical tasks by 0.3-2.8pp across languages and model sizes.
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
-
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.
-
Layer Collapse in Diffusion Language Models
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
-
Layer Collapse in Diffusion Language Models
Early layers in diffusion language models like LLaDA-8B collapse into redundant representations around a critical super-outlier activation due to overtraining, making them more robust to quantization and sparsity than...
-
Projection-Free Transformers via Gaussian Kernel Attention
Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
-
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
-
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
-
Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes
Fixed 16-bit binary token codes can replace trainable input embeddings in 32-layer decoder-only models while maintaining comparable held-out perplexity on 17B tokens.
-
Dimension-Free Saddle-Point Escape in Muon
Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
-
Sparse Layers are Critical to Scaling Looped Language Models
Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
-
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...
-
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
Finding Belief Geometries with Sparse Autoencoders
A new pipeline identifies candidate simplex geometries in Gemma-2-9B representations, with five clusters showing significant barycentric prediction advantages consistent with belief-state encoding.
-
Metriplector: From Field Theory to Neural Architecture
Metriplector treats neural computation as coupled metriplectic field dynamics whose stress-energy tensor readout achieves competitive results on vision, control, Sudoku, language modeling, and pathfinding with small p...
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model
Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.
-
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
-
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction
GLiNER-Relex unifies NER and RE in one zero-shot transformer-based model that achieves competitive results on CoNLL04, DocRED, FewRel, and CrossRE.
-
Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.
-
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
-
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over ...
-
Language corpora for the Dutch medical domain
A 35-billion-token Dutch medical corpus was assembled from translated, mined, and extracted sources and released publicly on Hugging Face as the first large-scale resource of its kind.
-
XekRung Technical Report
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
Rae, Oriol Vinyals, and Laurent Sifre
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page 2022
-
[2]
Llama: Open and efficient foundation language models, 2023
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023
work page 2023
-
[3]
Language Models are Few-Shot Learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[4]
Yi: Open Foundation Models by 01.AI
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
Deduplicating training data makes language models better, 2022
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better, 2022
work page 2022
-
[6]
AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md
work page 2024
-
[7]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page 2024
-
[8]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical re- port: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023
work page internal anchor Pith review arXiv 2023
-
[10]
Measuring massive multitask language understanding, 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021
work page 2021
-
[11]
Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018
work page 2018
-
[12]
Datatrove: large scale data processing, 2024
Guilherme Penedo, Hynek Kydlí ˇcek, Alessandro Cappelli, Mario Sasko, and Thomas Wolf. Datatrove: large scale data processing, 2024. URL https://github.com/huggingface/ datatrove
work page 2024
-
[13]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[14]
Universal language model fine-tuning for text classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text clas- sification. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 328–339, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10....
-
[15]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023
work page 2023
-
[16]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022
work page 2022
-
[17]
Palm: Scaling language modeling with pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240): 1–113, 2023
work page 2023
-
[18]
Galactica: A Large Language Model for Science
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022
work page internal anchor Pith review arXiv 2022
-
[19]
Trieu H. Trinh and Quoc V . Le. A simple method for commonsense reasoning, 2019
work page 2019
-
[20]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019
work page 2019
-
[21]
https://platform.openai.com/docs/gptbot
Openai gptbot documentation. https://platform.openai.com/docs/gptbot. Accessed: 2024-06-05
work page 2024
-
[22]
https://darkvisitors.com/agents/claudebot
Claudebot documentation. https://darkvisitors.com/agents/claudebot. Accessed: 2024-06-05
work page 2024
-
[23]
Common crawl. https://commoncrawl.org/. Accessed: 2024-06-05
work page 2024
-
[24]
Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models, 2023
work page 2023
-
[25]
Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, and Benoît Sagot. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, art. arXiv:2201.06642, January 2022
-
[26]
FastText.zip: Compressing text classification models
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. Fasttext.zip: Compressing text classification models.arXiv preprint arXiv:1612.03651, 2016
work page Pith review arXiv 2016
-
[27]
https://pypi.org/project/langdetect/
Langdetect library. https://pypi.org/project/langdetect/. Accessed: 2024-06-05
work page 2024
-
[28]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019
-
[29]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[30]
https://pypi.org/project/pycld2/
pycld2 library. https://pypi.org/project/pycld2/. Accessed: 2024-06-05
work page 2024
-
[31]
https://pypi.org/project/jusText/
justext library. https://pypi.org/project/jusText/. Accessed: 2024-06-05
work page 2024
-
[32]
On the resemblance and containment of documents
Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Com- pression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE, 1997. 12
work page 1997
-
[33]
The bigscience roots corpus: A 1.6 tb composite multilingual dataset
Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro V on Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826, 2022
work page 2022
-
[34]
Redpajama: an open dataset for training large language models, October
Together Computer. Redpajama: an open dataset for training large language models, October
-
[35]
URL https://github.com/togethercomputer/RedPajama-Data
-
[36]
Slimpajama: A 627b token cleaned and deduplicated version of redpajama, June 2023
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. Slimpajama: A 627b token cleaned and deduplicated version of redpajama, June 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B
work page 2023
-
[37]
The Refined- Web dataset for Falcon LLM: Outperforming curated corpora with web data only
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobeidli, Alessandro Cappelli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The Refined- Web dataset for Falcon LLM: Outperforming curated corpora with web data only. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[38]
https://pypi.org/project/trafilatura/
Trafilatura library. https://pypi.org/project/trafilatura/. Accessed: 2024-06-05
work page 2024
-
[39]
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Rus- sell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichand...
work page 2024
-
[40]
Content extraction using diverse feature sets
Matthew E Peters and Dan Lecocq. Content extraction using diverse feature sets. InProceedings of the 22nd international conference on world wide web, pages 89–90, 2013
work page 2013
-
[41]
https://github.com/codelucas/newspaper
Newspaper library. https://github.com/codelucas/newspaper. Accessed: 2024-06-05
work page 2024
-
[42]
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, ...
work page 2022
-
[43]
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A ques- tion answering challenge targeting commonsense knowledge. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapol...
-
[44]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, edi- tors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Lingui...
-
[45]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018
work page 2018
-
[46]
Piqa: Reasoning about physical commonsense in natural language, 2019
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language, 2019
work page 2019
-
[47]
Socialiqa: Com- monsense reasoning about social interactions, 2019
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Com- monsense reasoning about social interactions, 2019
work page 2019
-
[48]
Winogrande: An adversarial winograd schema challenge at scale, 2019
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019
work page 2019
-
[49]
https://htmlparser.sourceforge.net/
htmlparser library. https://htmlparser.sourceforge.net/. Accessed: 2024-06-05
work page 2024
-
[50]
Trafilatura: A web scraping library and command-line tool for text discovery and extraction
Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 122–131. Association for...
work page 2021
-
[51]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023
work page 2023
-
[52]
https://dsi.ut-capitole.fr/blacklists/
Ut1 url blacklists. https://dsi.ut-capitole.fr/blacklists/. Accessed: 2024-06-05
work page 2024
-
[53]
Bag of tricks for efficient text classification
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016
-
[54]
Quantifying memorization across neural language models, 2023
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models, 2023
work page 2023
-
[55]
Deduplicating training data mitigates privacy risks in language models
Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, 2022
work page 2022
-
[56]
Suffix arrays: a new method for on-line string searches
Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993
work page 1993
-
[57]
A survey on data selection for language models, 2024
Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, and William Yang Wang. A survey on data selection for language models, 2024
work page 2024
-
[58]
Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009
work page 2009
-
[59]
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Rus- sell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichand...
work page 2024
-
[60]
Map-neo: Highly capable and transparent bilingual large language model series, 2024
Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu, Noah Wang, Quehry Que, Ruibo Liu, Sine Liu, Shawn...
-
[61]
Proceedings of the Association for Computational Linguistics (ACL) , pages =
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoy- anov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meet- ing of th...
-
[62]
Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures
Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim, 2019. Leibniz-Institut für Deutsche Sprache. doi: 10. 14618/i...
work page 2019
-
[63]
https://huggingface.co/meta-llama/ Meta-Llama-3-70B-Instruct
Meta llama 3 70b instruct. https://huggingface.co/meta-llama/ Meta-Llama-3-70B-Instruct . Accessed: 2024-06-05
work page 2024
-
[64]
Self-Rewarding Language Models
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024
work page internal anchor Pith review arXiv 2024
-
[65]
Self-alignment with instruction backtranslation
Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason We- ston, and Mike Lewis. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023
-
[66]
Arctic-embed: Scalable, efficient, and accurate text embedding models
Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. Arctic-embed: Scalable, efficient, and accurate text embedding models. arXiv preprint arXiv:2405.05374, 2024
-
[67]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[68]
Leandro V on Werra. text-clustering, 2024. URL https://github.com/huggingface/ text-clustering
work page 2024
-
[69]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/ abs/1908.10084
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[70]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction, 2020. URL https://arxiv.org/abs/1802.03426
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[71]
A density-based algorithm for discovering clusters in large spatial databases with noise
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231, 1996
work page 1996
-
[72]
and Richardson, Kyle and Dodge, Jesse , year =
Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, A. Jha, Oyvind Tafjord, Dustin Schwenk, Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hanna Ha- jishirzi, Noah A. Smith, Kyle Richardson, and Jesse Dodge. Paloma: A benchmark for evaluating language model fit. ArXiv, abs/2312.10523, 2023. URL https://api.semanticscholar. org...
-
[73]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10...
-
[74]
Christine Basta, Marta R. Costa-jussà, and Noe Casas. Evaluating the underlying gender bias in contextualized word embeddings. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 33–39, 2019
work page 2019
-
[75]
Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W. Black, and Yulia Tsvetkov. Measuring bias in contextualized word representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 166–172, 2019. 15
work page 2019
-
[76]
The woman worked as a babysitter: On biases in language generation
Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3407–3412, 2019
work page 2019
-
[77]
Gender bias in contextualized word embeddings
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. Gender bias in contextualized word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 629–634, 2019
work page 2019
-
[78]
Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11737–11762, 2023
work page 2023
-
[79]
A statistical interpretation of term specificity and its application in retrieval
Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972
work page 1972
-
[80]
The Lolita effect: The media sexualization of young girls and what we can do about it
M Gigi Durham. The Lolita effect: The media sexualization of young girls and what we can do about it. Abrams, 2009
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.