super hub Canonical reference

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Francis Song, Jack W. Rae, Jordan Hoffmann, Katie Millican, Sebastian Borgeaud, Trevor Cai · 2021 · cs.CL · arXiv 2112.11446

Canonical reference. 83% of citing Pith papers cite this work as background.

100 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 100 citing papers more from Francis Song arXiv PDF

abstract

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 22 method 2

citation-polarity summary

background 20 unclear 2 use method 2

claims ledger

abstract Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, an

authors

Francis Song Jack W. Rae Jordan Hoffmann Katie Millican Sebastian Borgeaud Trevor Cai

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

cs.IR · 2024-03-06 · unverdicted · novelty 8.0

BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

cs.LG · 2026-05-29 · conditional · novelty 7.0

Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

eess.AS · 2026-04-28 · unverdicted · novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.

MetaKE: Meta-Learning for Knowledge Editing Toward a Better Accuracy-Editability Trade-off

cs.CL · 2026-03-13 · unverdicted · novelty 7.0

MetaKE unifies knowledge editing stages via bi-level optimization and a structural gradient proxy to improve the accuracy-editability trade-off over prior methods.

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

cs.CL · 2024-12-30 · unverdicted · novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

cs.LG · 2024-03-06 · conditional · novelty 7.0

GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

cs.LG · 2024-02-29 · unverdicted · novelty 7.0

Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

cs.LG · 2024-01-24 · accept · novelty 7.0

VisualWebArena benchmark demonstrates that state-of-the-art multimodal agents still exhibit significant limitations on visually grounded web tasks.

C-Pack: Packed Resources For General Chinese Embeddings

cs.CL · 2023-09-14 · accept · novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

Towards Measuring the Representation of Subjective Global Opinions in Language Models

cs.CL · 2023-06-28 · conditional · novelty 7.0

LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.

Accelerating Large Language Model Decoding with Speculative Sampling

cs.CL · 2023-02-02 · accept · novelty 7.0

Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

cs.CL · 2022-11-22 · unverdicted · novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

Large Language Models are Zero-Shot Reasoners

cs.CL · 2022-05-24 · accept · novelty 7.0

Adding the fixed prompt 'Let's think step by step' enables large language models to achieve substantial zero-shot gains on arithmetic, symbolic, and logical reasoning benchmarks without any task-specific examples.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

cs.CV · 2022-05-23 · accept · novelty 7.0

Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

cs.RO · 2022-04-04 · accept · novelty 7.0

SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.

Unified Data Selection for LLM Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.

A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Derives α^{-1/3} scaling for generalization error in online softmax classification from boundary layers in a teacher-student model.

citing papers explorer

Showing 50 of 100 citing papers.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models cs.CL · 2022-01-28 · accept · none · ref 50 · internal anchor
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders cs.IR · 2024-03-06 · unverdicted · none · ref 33 · internal anchor
BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling cs.CL · 2023-04-03 · accept · none · ref 231 · internal anchor
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them cs.LG · 2026-05-29 · conditional · none · ref 69 · internal anchor
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 32 · internal anchor
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models eess.AS · 2026-04-28 · unverdicted · none · ref 60 · internal anchor
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
MetaKE: Meta-Learning for Knowledge Editing Toward a Better Accuracy-Editability Trade-off cs.CL · 2026-03-13 · unverdicted · none · ref 3 · internal anchor
MetaKE unifies knowledge editing stages via bi-level optimization and a structural gradient proxy to improve the accuracy-editability trade-off over prior methods.
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention cs.LG · 2025-10-05 · unverdicted · none · ref 25 · internal anchor
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs cs.CL · 2024-12-30 · unverdicted · none · ref 114 · internal anchor
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection cs.LG · 2024-03-06 · conditional · none · ref 31 · internal anchor
GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models cs.LG · 2024-02-29 · unverdicted · none · ref 25 · internal anchor
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks cs.LG · 2024-01-24 · accept · none · ref 6 · internal anchor
VisualWebArena benchmark demonstrates that state-of-the-art multimodal agents still exhibit significant limitations on visually grounded web tasks.
C-Pack: Packed Resources For General Chinese Embeddings cs.CL · 2023-09-14 · accept · none · ref 48 · internal anchor
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Towards Measuring the Representation of Subjective Global Opinions in Language Models cs.CL · 2023-06-28 · conditional · none · ref 70 · internal anchor
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
Accelerating Large Language Model Decoding with Speculative Sampling cs.CL · 2023-02-02 · accept · none · ref 15 · internal anchor
Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks cs.CL · 2022-11-22 · unverdicted · none · ref 24 · internal anchor
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
Large Language Models are Zero-Shot Reasoners cs.CL · 2022-05-24 · accept · none · ref 3 · internal anchor
Adding the fixed prompt 'Let's think step by step' enables large language models to achieve substantial zero-shot gains on arithmetic, symbolic, and logical reasoning benchmarks without any task-specific examples.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding cs.CV · 2022-05-23 · accept · none · ref 50 · internal anchor
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 47 · internal anchor
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 279 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 87 · internal anchor
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances cs.RO · 2022-04-04 · accept · none · ref 6 · internal anchor
SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
Unified Data Selection for LLM Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 51 · internal anchor
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification cs.LG · 2026-05-21 · unverdicted · none · ref 4 · internal anchor
Derives α^{-1/3} scaling for generalization error in online softmax classification from boundary layers in a teacher-student model.
When is Warmstarting Effective for Scaling Language Models? cs.LG · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
A 2x growth factor in model warmstarting yields reliable training speedups for language models under 20 tokens/parameter budgets, with an empirical upper bound on effective growth factors.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 38 · internal anchor
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization stat.ML · 2026-05-07 · unverdicted · none · ref 4 · 2 links · internal anchor
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA cs.AI · 2026-05-05 · unverdicted · none · ref 67 · internal anchor
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition cs.CL · 2026-05-04 · unverdicted · none · ref 32 · internal anchor
InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger scales up to 7B models and 425B tokens.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 247 · internal anchor
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training cs.CR · 2026-04-14 · unverdicted · none · ref 2 · internal anchor
CoLA reveals that subset training creates new privacy leakage surfaces via side-channel metadata and model outputs, enabling training-membership and selection-participation membership inference attacks.
AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM cs.CL · 2025-10-20 · unverdicted · none · ref 26 · internal anchor
AtlasKV integrates billion-scale KGs into LLMs parametrically with sub-linear complexity and low memory by converting triples into key-value representations handled by the model's attention.
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure cs.LG · 2025-09-23 · unverdicted · none · ref 51 · internal anchor
CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.
PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention cs.CL · 2025-06-16 · unverdicted · none · ref 29 · internal anchor
PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
Superposition Yields Robust Neural Scaling cs.LG · 2025-05-15 · conditional · none · ref 13 · internal anchor
Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 92 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
MiniMax-01: Scaling Foundation Models with Lightning Attention cs.CL · 2025-01-14 · unverdicted · none · ref 41 · internal anchor
MiniMax-01 models match GPT-4o and Claude-3.5-Sonnet performance while providing 20-32 times longer context windows through lightning attention and MoE scaling.
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP cs.CL · 2024-11-08 · unverdicted · none · ref 49 · internal anchor
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 213 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 146 · internal anchor
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies cs.CL · 2024-04-09 · conditional · none · ref 34 · internal anchor
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training cs.CV · 2024-03-14 · unverdicted · none · ref 92 · internal anchor
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval cs.CL · 2024-01-31 · unverdicted · none · ref 46 · internal anchor
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
Gemini: A Family of Highly Capable Multimodal Models cs.CL · 2023-12-19 · conditional · none · ref 79 · internal anchor
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 17 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 175 · internal anchor
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Language Modeling Is Compression cs.LG · 2023-09-19 · accept · none · ref 13 · internal anchor
Large language models serve as strong general-purpose lossless compressors for text, images, and audio, outperforming domain-specific methods and revealing insights into scaling, tokenization, and in-context learning.
Reinforced Self-Training (ReST) for Language Modeling cs.CL · 2023-08-17 · unverdicted · none · ref 19 · internal anchor
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
AudioPaLM: A Large Language Model That Can Speak and Listen cs.CL · 2023-06-22 · unverdicted · none · ref 37 · internal anchor
AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only cs.CL · 2023-06-01 · unverdicted · none · ref 35 · internal anchor
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer