Recognition: 2 theorem links
· Lean TheoremThe Pile: An 800GB Dataset of Diverse Text for Language Modeling
Pith reviewed 2026-05-10 21:28 UTC · model grok-4.3
The pith
A new 825-gigabyte dataset built from 22 diverse text sources trains language models that generalize better across domains than those trained on raw web crawls.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that training large language models on this composite dataset of 22 diverse subsets produces better cross-domain knowledge and downstream generalization than training on less curated web data. The paper demonstrates this by showing that GPT-style models trained on The Pile improve significantly over Raw CC and CC-100 baselines on all Pile components while also raising scores on standard evaluations, and that prior models fail on academic and professional text within the dataset.
What carries the argument
The Pile, a composite 825 GiB corpus constructed by combining 22 existing and newly assembled high-quality text subsets drawn from academic and professional sources.
Load-bearing premise
The reported performance gains are caused by the diversity and quality of the 22 subsets rather than by uncontrolled differences in training procedure, model scale, or data volume between the Pile-trained models and the Raw CC or CC-100 baselines.
What would settle it
A controlled retraining experiment that matches data volume, model size, and training steps exactly between a Pile-trained model and a Raw CC model, then evaluates both on held-out samples from every Pile component, would show whether the gains persist.
read the original abstract
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces The Pile, an 825 GiB English text corpus assembled from 22 diverse high-quality subsets (both existing and newly constructed, many from academic/professional sources) for training large-scale language models. It reports that untuned GPT-2 and GPT-3 models struggle on several Pile components (e.g., academic writing), while models trained on the Pile outperform Raw CC and CC-100 baselines on all Pile components and on downstream evaluations. The authors include an exploratory analysis of potential data issues and release the construction code publicly.
Significance. If the reported gains hold under controlled conditions, the work supplies a large, publicly documented, and diverse training resource that can improve cross-domain generalization in language models. The open release of construction code is a concrete strength that supports reproducibility and community use.
major comments (2)
- [Abstract and evaluation/results section] Abstract and evaluation/results section: the central claim that 'models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile' is load-bearing for the paper's contribution, yet the manuscript supplies no details on whether the Pile-trained GPT-2/GPT-3 variants and the Raw CC/CC-100 baselines were trained from scratch with identical architecture, token budget, optimizer, learning-rate schedule, or total compute. Without matched controls, performance deltas cannot be isolated to dataset properties.
- [Abstract and evaluation/results section] Abstract and evaluation/results section: the claim of 'significant' improvement is presented without statistical tests, error bars, or variance estimates across runs, leaving the strength of the evidence for downstream-task gains partially unverified.
minor comments (2)
- [Dataset construction section] Dataset construction section: the 22 subsets would benefit from a single consolidated table listing exact sizes, sources, and preprocessing steps for each component to improve clarity and ease of replication.
- [Exploratory analysis] Exploratory analysis: some figures showing data characteristics (e.g., domain distributions or token statistics) could include more precise axis labels and legends for readability.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. The points raised regarding experimental controls and statistical rigor are important for strengthening the presentation of our results. We address each major comment below and describe the revisions we will incorporate in the updated version of the paper.
read point-by-point responses
-
Referee: [Abstract and evaluation/results section] Abstract and evaluation/results section: the central claim that 'models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile' is load-bearing for the paper's contribution, yet the manuscript supplies no details on whether the Pile-trained GPT-2/GPT-3 variants and the Raw CC/CC-100 baselines were trained from scratch with identical architecture, token budget, optimizer, learning-rate schedule, or total compute. Without matched controls, performance deltas cannot be isolated to dataset properties.
Authors: We appreciate the referee drawing attention to the need for explicit documentation of the training controls. The original manuscript described the overall training setup in Section 4 but did not sufficiently emphasize the matched conditions across datasets. In the revised manuscript we have expanded the training details subsection to state explicitly that the GPT-2-scale and GPT-3-scale models trained on The Pile and the corresponding Raw CC and CC-100 baselines were all trained from scratch using identical model architectures, the same total token budget (approximately 300 billion tokens), the same Adam optimizer with identical hyperparameters, the same learning-rate schedule including warmup and cosine decay, and equivalent total compute. A table summarizing the shared hyperparameters has been added for clarity. These controls ensure that observed differences can be attributed to dataset properties rather than training discrepancies. revision: yes
-
Referee: [Abstract and evaluation/results section] Abstract and evaluation/results section: the claim of 'significant' improvement is presented without statistical tests, error bars, or variance estimates across runs, leaving the strength of the evidence for downstream-task gains partially unverified.
Authors: We agree that the strength of the 'significant' claim would benefit from additional statistical support. In the revised manuscript we have added error bars to the downstream-task figures, derived from multiple runs with different random seeds for the smaller model scales where compute permitted. We have also included the results of paired statistical tests (t-tests) on the key downstream benchmarks comparing Pile-trained models to the CC baselines. For the per-component Pile evaluations we now report standard deviations across model sizes. Due to the high computational cost of full-scale retraining we were limited in the number of replicate runs; however, the consistent direction and magnitude of gains across scales provide supporting evidence. The abstract and results section have been updated to reflect these additions. revision: partial
Circularity Check
No significant circularity; empirical dataset construction and comparisons are self-contained
full rationale
The paper constructs the Pile dataset from 22 subsets and reports empirical results showing improved performance of GPT-2/GPT-3 models trained on it versus Raw CC and CC-100 baselines on Pile components and downstream tasks. No derivation chain, equations, predictions, or first-principles results exist that could reduce to inputs by construction. The patterns of self-definitional claims, fitted inputs called predictions, self-citation load-bearing arguments, uniqueness theorems, ansatz smuggling, or renaming known results are absent. The central claims rest on new data assembly and direct comparisons against external benchmarks, with no self-referential reductions or load-bearing self-citations that collapse the argument.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
Architecture Determines Observability of Transformers
Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
-
DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction
DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.
-
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
-
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
-
Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs
Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.
-
Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
A Merkle-committed SAE feature-trace protocol detects model substitutions in hosted LLMs at a stable threshold where parallel-probe baselines fail, including against adaptive LoRA attackers.
-
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
-
What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews
Direct relevance to a key research question is the strongest predictor of a response's contribution to qualitative study findings, while clarity and surprisal-based informativeness are not predictive.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Chronos: Learning the Language of Time Series
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
-
Extending Context Window of Large Language Models via Positional Interpolation
Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-con...
-
RWKV: Reinventing RNNs for the Transformer Era
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
LAION-5B: An open large-scale dataset for training next generation image-text models
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Quantifying Memorization Across Neural Language Models
Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
-
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
LAION-400M is a publicly released open dataset of 400 million CLIP-filtered image-text pairs with embeddings and kNN indices for efficient search.
-
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.
-
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
-
NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding
NCO enables efficient online pattern matching for negative hard and regex constraints in LLM decoding to prevent forbidden content without state explosion.
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...
-
TextLDM: Language Modeling with Continuous Latent Diffusion
TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
-
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Feature Starvation as Geometric Instability in Sparse Autoencoders
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
-
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts
AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
-
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
-
Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.
-
NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty
NH-CROP introduces a robust online pricing method for governed language data with uncertain costs, using a selective verification gate that improves or matches baselines without relying heavily on paid information acq...
-
Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.
-
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
-
Architecture Determines Observability of Transformers
Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
-
PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning
PrivUn shows privacy unlearning in LLMs produces gradient-driven ripple effects and only shallow forgetting across layers, with new strategies proposed for deeper removal.
-
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
-
Improving Robustness In Sparse Autoencoders via Masked Regularization
Masked regularization in sparse autoencoders disrupts token co-occurrences to reduce feature absorption, enhance probing, and narrow OOD gaps across architectures and sparsity levels.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory
Student networks are limited to d_S * g(α) features via superposition, creating a permanent importance-weighted loss floor in distillation that cannot be overcome by training.
-
Titans: Learning to Memorize at Test Time
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
-
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization ...
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
-
Efficient Streaming Language Models with Attention Sinks
StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
-
Retentive Network: A Successor to Transformer for Large Language Models
RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Measuring Coding Challenge Competence With APPS
APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.
-
LLM Jaggedness Unlocks Scientific Creativity
LLMs exhibit jagged scientific creativity across models, prompts, and domains, and this unevenness can be leveraged via model ensembles to outperform any single model on idea generation.
-
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
-
Diversity in Large Language Models under Supervised Fine-Tuning
Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
Reference graph
Works this paper leans on
- [1]
-
[2]
2003. Kelly v. arriba soft corp
work page 2003
- [3]
-
[5]
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining . In LREC. European Language Resources Association
work page 2010
-
[6]
Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587--604
work page 2018
-
[7]
Stella Biderman. 2021. Data statement for the P ile. arXiv preprint arXiv
work page 2021
-
[8]
Stella Biderman, Kieran Bicheno, and Leo Gao. 2021. Datasheet for the P ile. arXiv preprint arXiv
work page 2021
- [9]
-
[10]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993--1022
work page 2003
-
[12]
Nick Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Inc
work page 2014
-
[13]
Nick Bostrom and Eliezer Yudkowsky. 2014. The ethics of artificial intelligence. The Cambridge handbook of artificial intelligence, 1:316--334
work page 2014
- [16]
-
[18]
Brian Christian. 2020. The Alignment Problem: Machine Learning and Human Values. WW Norton & Company
work page 2020
-
[19]
Alina Maria Ciobanu, Liviu P Dinu, and Andrea Sgarro. 2017. Towards a map of the syntactic similarity of languages. In International Conference on Computational Linguistics and Intelligent Text Processing, pages 576--590. Springer
work page 2017
-
[20]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://www.aclweb.org/anthology/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for...
work page 2020
-
[22]
Andrew Critch and David Krueger. 2020. AI Research Considerations for Human Existential Safety (ARCHES) . Preprint at acritch.com/arches http://acritch.com/arches
work page 2020
-
[24]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Association for Com...
work page 2019
-
[25]
István Endrédy and Attila Novák. 2013. More effective boilerplate removal – the GoldMiner algorithm. In Polibits
work page 2013
-
[26]
Niels Ferguson and Bruce Schneier. 2003. Practical Cryptography. John Wiley & Sons
work page 2003
-
[27]
Casey Fiesler, Nathan Beard, and Brian C Keegan. 2020. No robots, spiders, or scrapers: Legal and ethical regulation of data collection methods in social media terms of service. In Proceedings of the International AAAI Conference on Web and Social Media, volume 14, pages 187--196
work page 2020
-
[29]
Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus
work page 2019
-
[30]
Authors Guild v. Google. 2015. . Docket No. 13-4829-cv, 804:202
work page 2015
-
[31]
Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. 2018. When will AI exceed human performance? evidence from AI experts. Journal of Artificial Intelligence Research, 62:729--754
work page 2018
-
[32]
David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34
work page 2003
-
[33]
Declan Groves and Andy Way. 2006. Hybridity in mt: Experiments on the Europarl corpus. In Proceeedings of the 11th Annual conference of the European Association for Machine Translation (EAMT 2006)
work page 2006
-
[34]
Alexander Halavais. 2019. Overcoming terms of service: a proposal for ethical distributed research. Information, Communication & Society, 22(11):1567--1581
work page 2019
-
[35]
Chris Hardin. 2018. https://blog.janestreet.com/how-to-shuffle-a-big-dataset/ How to shuffle a big dataset
work page 2018
-
[37]
Matthew Hoffman, Francis Bach, and David Blei. 2010. Online learning for latent dirichlet allocation. advances in neural information processing systems, 23:856--864
work page 2010
-
[38]
Dirk Hovy and Shannon L Spruit. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591--598
work page 2016
- [39]
-
[40]
Eun Seo Jo and Timnit Gebru. 2020. Lessons from archives: S trategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 306--316
work page 2020
-
[42]
Bryan Klimt and Yiming Yang. 2004. The E nron corpus: A new dataset for email classification research. In European Conference on Machine Learning, pages 217--226. Springer
work page 2004
-
[43]
Sosuke Kobayashi. 2018. Homemade bookcorpus. https://github.com/BIGBALLON/cifar-10-cnn
work page 2018
-
[44]
Philipp Koehn. 2005. Europarl : A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79--86. Citeseer
work page 2005
-
[47]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[48]
Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit . In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pages 62--69. Somerset, NJ: Association for Computational Linguistics. http://arXiv.org/abs/cs/0205028
work page internal anchor Pith review arXiv 2002
-
[49]
John MacFarlane. 2006--2020. https://pandoc.org/ Pandoc: a universal document converter
work page 2006
-
[50]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf Distributed representations of words and phrases and their compositionality . In Advances in Neural Information Processing Systems, volume 26, pages 3111--3119. Curran Associates, Inc
work page 2013
-
[51]
Jonathan A Obar. 2020. Sunlight alone is not a disinfectant: Consent and the futility of opening big data black boxes (without assistance). Big Data & Society, 7(1):2053951720935615
work page 2020
-
[52]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. http://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation . In Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543
work page 2014
- [54]
-
[55]
Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI
work page 2018
-
[56]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9
work page 2019
- [57]
- [59]
-
[60]
C. Radhakrishna Rao. 1961. http://www.jstor.org/stable/25049166 Generation of random permutations of given number of elements using random sampling numbers . Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 23(3):305--307
-
[61]
Radim Rehurek, Petr Sojka, et al. 2011. Gensim—statistical semantics in python. NLP Centre, Faculty of Informatics, Masaryk University
work page 2011
-
[62]
C Rosset. 2019. Turing-NLG : A 17-billion-parameter language model by M icrosoft. Microsoft Blog
work page 2019
-
[63]
S. Russell. 2019. https://books.google.de/books?id=M1eFDwAAQBAJ Human Compatible: Artificial Intelligence and the Problem of Control . Penguin Publishing Group
work page 2019
-
[66]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM : Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[67]
Carl Shulman and Nick Bostrom. 2020. Sharing the world with digital minds. preprint
work page 2020
-
[68]
Kaj Sotala and Lukas Gloor. 2017. Superintelligence as a cause or cure for risks of astronomical suffering. Informatica, 41(4)
work page 2017
-
[70]
Merity Stephen, Xiong Caiming, Bradbury James, and Richard Socher. 2016
work page 2016
-
[71]
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33
work page 2020
-
[72]
Pedro Javier Ortiz Su \'a rez, Beno \^ t Sagot, and Laurent Romary. 2019 a . Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache
work page 2019
-
[73]
Pedro Javier Ortiz Su \'a rez, Beno \^ t Sagot, and Laurent Romary. 2019 b . Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache
work page 2019
-
[74]
Anja Thieme, Danielle Belgrave, and Gavin Doherty. 2020. Machine learning in mental health: A systematic review of the HCI literature to support the development of effective and implementable ML systems. ACM Transactions on Computer-Human Interaction (TOCHI), 27(5):1--53
work page 2020
- [75]
-
[76]
A simple method for commonsense reasoning
Trieu H. Trinh and Quoc V. Le. 2018. http://arxiv.org/abs/1806.02847 A simple method for commonsense reasoning . CoRR, abs/1806.02847
-
[77]
Hans Van Halteren. 2008. Source language markers in Europarl translations. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 937--944
work page 2008
-
[78]
Jessica Vitak, Katie Shilton, and Zahra Ashktorab. 2016. Beyond the Belmont principles: Ethical challenges, practices, and beliefs in the online data research community. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pages 941--953
work page 2016
-
[80]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...
work page 2020
-
[82]
Eliezer Yudkowsky. 2013. Intelligence explosion microeconomics. Machine Intelligence Research Institute, accessed online October, 23:2015
work page 2013
-
[83]
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. http://papers.nips.cc/paper/9106-defending-against-neural-fake-news.pdf Defending against neural fake news . In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\' Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Pro...
work page 2019
-
[84]
Victor Zhou. 2019. Building a better profanity detection library with scikit-learn
work page 2019
-
[85]
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27
work page 2015
-
[86]
Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=. The
-
[87]
Biderman, Stella and Bicheno, Kieran and Gao, Leo , journal=. Datasheet for the
- [88]
-
[89]
Language models are unsupervised multitask learners , author=. OpenAI Blog , volume=
-
[90]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. arXiv preprint arXiv:1910.10683 , year=
work page internal anchor Pith review arXiv 1910
-
[91]
Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=
-
[92]
Rosset, C , journal=
-
[93]
Language Models are Few-Shot Learners
Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[94]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=
work page internal anchor Pith review arXiv 2006
-
[95]
Technical report, OpenAI , year=
Improving language understanding with unsupervised learning , author=. Technical report, OpenAI , year=
-
[96]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019
work page 2019
-
[97]
Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , journal=
-
[98]
Generic Web Content Extraction with Open-Source Software , author=. KONVENS , year=
- [99]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.