pith. sign in

super hub Canonical reference

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Canonical reference. 83% of citing Pith papers cite this work as background.

101 Pith papers citing it
Background 83% of classified citations
abstract

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

hub tools

citation-role summary

background 22 method 2

citation-polarity summary

claims ledger

  • abstract Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, an

authors

co-cited works

clear filters

representative citing papers

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

cs.LG · 2024-03-06 · conditional · novelty 7.0

GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.

C-Pack: Packed Resources For General Chinese Embeddings

cs.CL · 2023-09-14 · accept · novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

Large Language Models are Zero-Shot Reasoners

cs.CL · 2022-05-24 · accept · novelty 7.0

Adding the fixed prompt 'Let's think step by step' enables large language models to achieve substantial zero-shot gains on arithmetic, symbolic, and logical reasoning benchmarks without any task-specific examples.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Unified Data Selection for LLM Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.

citing papers explorer

Showing 21 of 21 citing papers after filters.

  • Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models eess.AS · 2026-04-28 · unverdicted · none · ref 60 · internal anchor

    Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.

  • C-Pack: Packed Resources For General Chinese Embeddings cs.CL · 2023-09-14 · accept · none · ref 48 · internal anchor

    C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

  • Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding cs.CV · 2022-05-23 · accept · none · ref 50 · internal anchor

    Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

  • Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 87 · internal anchor

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  • Do As I Can, Not As I Say: Grounding Language in Robotic Affordances cs.RO · 2022-04-04 · accept · none · ref 6 · internal anchor

    SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.

  • UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 38 · internal anchor

    A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

  • Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization stat.ML · 2026-05-07 · unverdicted · none · ref 4 · 2 links · internal anchor

    Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.

  • Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA cs.AI · 2026-05-05 · unverdicted · none · ref 67 · internal anchor

    Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.

  • Superposition Yields Robust Neural Scaling cs.LG · 2025-05-15 · conditional · none · ref 13 · internal anchor

    Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.

  • MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training cs.CV · 2024-03-14 · unverdicted · none · ref 92 · internal anchor

    MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

  • Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 97 · internal anchor

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  • Towards Expert-Level Medical Question Answering with Large Language Models cs.CL · 2023-05-16 · unverdicted · none · ref 10 · internal anchor

    Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.

  • BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 120 · internal anchor

    BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

  • Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them cs.CL · 2022-10-17 · accept · none · ref 21 · internal anchor

    Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.

  • Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned cs.CL · 2022-08-23 · accept · none · ref 44 · internal anchor

    RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.

  • LaMDA: Language Models for Dialog Applications cs.CL · 2022-01-20 · unverdicted · none · ref 24 · internal anchor

    LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.

  • Position: LLM Inference Should Be Evaluated as Energy-to-Token Production cs.CE · 2026-05-12 · unverdicted · none · ref 36 · internal anchor

    LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

  • JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency cs.CL · 2026-04-03 · unverdicted · none · ref 42 · internal anchor

    JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.

  • Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 79 · internal anchor

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  • A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 66 · internal anchor

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

  • A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 116 · internal anchor

    A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.