pith. sign in

super hub Canonical reference

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Canonical reference. 83% of citing Pith papers cite this work as background.

105 Pith papers citing it
Background 83% of classified citations
abstract

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

hub tools

citation-role summary

background 22 method 2

citation-polarity summary

claims ledger

  • abstract Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, an

authors

co-cited works

clear filters

representative citing papers

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

cs.LG · 2024-03-06 · conditional · novelty 7.0

GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.

C-Pack: Packed Resources For General Chinese Embeddings

cs.CL · 2023-09-14 · accept · novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

Large Language Models are Zero-Shot Reasoners

cs.CL · 2022-05-24 · accept · novelty 7.0

Adding the fixed prompt 'Let's think step by step' enables large language models to achieve substantial zero-shot gains on arithmetic, symbolic, and logical reasoning benchmarks without any task-specific examples.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

citing papers explorer

Showing 24 of 24 citing papers after filters.

  • Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling cs.CL · 2023-04-03 · accept · none · ref 231 · internal anchor

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  • C-Pack: Packed Resources For General Chinese Embeddings cs.CL · 2023-09-14 · accept · none · ref 48 · internal anchor

    C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

  • Towards Measuring the Representation of Subjective Global Opinions in Language Models cs.CL · 2023-06-28 · conditional · none · ref 70 · internal anchor

    LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.

  • Accelerating Large Language Model Decoding with Speculative Sampling cs.CL · 2023-02-02 · accept · none · ref 15 · internal anchor

    Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.

  • Gemini: A Family of Highly Capable Multimodal Models cs.CL · 2023-12-19 · conditional · none · ref 79 · internal anchor

    Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.

  • The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 17 · internal anchor

    Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

  • DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 175 · internal anchor

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  • Language Modeling Is Compression cs.LG · 2023-09-19 · accept · none · ref 13 · internal anchor

    Large language models serve as strong general-purpose lossless compressors for text, images, and audio, outperforming domain-specific methods and revealing insights into scaling, tokenization, and in-context learning.

  • Reinforced Self-Training (ReST) for Language Modeling cs.CL · 2023-08-17 · unverdicted · none · ref 19 · internal anchor

    ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

  • AudioPaLM: A Large Language Model That Can Speak and Listen cs.CL · 2023-06-22 · unverdicted · none · ref 37 · internal anchor

    AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.

  • The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only cs.CL · 2023-06-01 · unverdicted · none · ref 35 · internal anchor

    Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.

  • Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 97 · internal anchor

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  • Towards Expert-Level Medical Question Answering with Large Language Models cs.CL · 2023-05-16 · unverdicted · none · ref 10 · internal anchor

    Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.

  • BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 90 · internal anchor

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  • SemDeDup: Data-efficient learning at web-scale through semantic deduplication cs.LG · 2023-03-16 · unverdicted · none · ref 21 · internal anchor

    SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.

  • Multimodal Chain-of-Thought Reasoning in Language Models cs.CL · 2023-02-02 · accept · none · ref 29 · internal anchor

    Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.

  • The Flan Collection: Designing Data and Methods for Effective Instruction Tuning cs.AI · 2023-01-31 · conditional · none · ref 45 · internal anchor

    The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.

  • REPLUG: Retrieval-Augmented Black-Box Language Models cs.CL · 2023-01-30 · conditional · none · ref 34 · internal anchor

    REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.

  • MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning cs.CV · 2023-10-14 · unverdicted · none · ref 39 · internal anchor

    MiniGPT-v2 adds unique task identifiers to a large language model so one system can perform image description, visual question answering, and visual grounding after three-stage training.

  • PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 117 · internal anchor

    PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

  • StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 88 · internal anchor

    StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

  • PaLI-X: On Scaling up a Multilingual Vision and Language Model cs.CV · 2023-05-29 · unverdicted · none · ref 90 · internal anchor

    Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.

  • A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 66 · internal anchor

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

  • A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 116 · internal anchor

    A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.