hub Canonical reference

Exploring the Limits of Language Modeling

Wu, Yonghui · 2016 · cs.CL · arXiv 1602.02410

Canonical reference. 83% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding. We extend current models to deal with two key challenges present in this task: corpora and vocabulary sizes, and complex, long term structure of language. We perform an exhaustive study on techniques such as character Convolutional Neural Networks or Long-Short Term Memory, on the One Billion Word Benchmark. Our best single model significantly improves state-of-the-art perplexity from 51.3 down to 30.0 (whilst reducing the number of parameters by a factor of 20), while an ensemble of models sets a new record by improving perplexity from 41.0 down to 23.7. We also release these models for the NLP and ML community to study and improve upon.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 other 1

citation-polarity summary

background 5 unclear 1

representative citing papers

WaveNet: A Generative Model for Raw Audio

cs.SD · 2016-09-12 · accept · novelty 9.0

WaveNet generates realistic raw audio using an autoregressive neural network with dilated convolutions, achieving state-of-the-art naturalness in speech synthesis for English and Mandarin.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

cs.LG · 2017-01-23 · accept · novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

Density estimation using Real NVP

cs.LG · 2016-05-27 · accept · novelty 8.0

Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.

Improving language models by retrieving from trillions of tokens

cs.CL · 2021-12-08 · unverdicted · novelty 7.0

RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.

Unsupervised Cross-lingual Representation Learning at Scale

cs.CL · 2019-11-05 · conditional · novelty 7.0

XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.

Augmenting Self-attention with Persistent Memory

cs.LG · 2019-07-02 · unverdicted · novelty 7.0

Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

Generating Long Sequences with Sparse Transformers

cs.LG · 2019-04-23 · unverdicted · novelty 7.0

Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.

Deep Learning Scaling is Predictable, Empirically

cs.LG · 2017-12-01 · unverdicted · novelty 7.0

Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.

Mixed Precision Training

cs.AI · 2017-10-10 · accept · novelty 7.0

Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.

Information as Maximum-Caliber Deviation: A bridge between Integrated Information Theory and the Free Energy Principle

q-bio.NC · 2026-05-03 · unverdicted · novelty 6.0

Information defined as maximum-caliber deviation derives IIT 3.0 cause-effect repertoires from constrained entropy maximization and equates to prediction error under CLT and LDT.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

Scaling Laws and Interpretability of Learning from Repeated Data

cs.LG · 2022-05-21 · accept · novelty 6.0

Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.

Compressive Transformers for Long-Range Sequence Modelling

cs.LG · 2019-11-13 · unverdicted · novelty 6.0

Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.

LaMDA: Language Models for Dialog Applications

cs.CL · 2022-01-20 · unverdicted · novelty 6.0

LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

cs.CL · 2022-01-28 · unverdicted · novelty 5.0

Trained the largest monolithic 530B-parameter transformer language model to date and reported new state-of-the-art zero- and few-shot results on multiple NLP benchmarks.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Attention Is All You Need

cs.CL · 2017-06-12 · unverdicted · novelty 5.0

Pith review generated a malformed one-line summary.

Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER

cs.CL · 2019-07-25 · unverdicted · novelty 4.0

Cross-lingual fine-tuning of pre-trained LMs yields significant gains on small gold Indonesian NER and competitive results on large silver data versus monolingual LM or POS transfer.

Why Build an Assistant in Minecraft?

cs.AI · 2019-07-22 · unverdicted · novelty 4.0

A rationale is presented for developing an assistant in Minecraft to advance natural language understanding and dialogue learning.

Scalable Multi Corpora Neural Language Models for ASR

cs.CL · 2019-07-02 · unverdicted · novelty 4.0

The authors report scalable training of neural LMs from heterogeneous corpora for ASR second-pass rescoring, delivering 6.2% relative WER reduction with minimal latency increase.

citing papers explorer

Showing 23 of 23 citing papers.

WaveNet: A Generative Model for Raw Audio cs.SD · 2016-09-12 · accept · none · ref 20
WaveNet generates realistic raw audio using an autoregressive neural network with dilated convolutions, achieving state-of-the-art naturalness in speech synthesis for English and Mandarin.
Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 28
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer cs.LG · 2017-01-23 · accept · none · ref 27
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Density estimation using Real NVP cs.LG · 2016-05-27 · accept · none · ref 32
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
Improving language models by retrieving from trillions of tokens cs.CL · 2021-12-08 · unverdicted · none · ref 28 · internal anchor
RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.
Unsupervised Cross-lingual Representation Learning at Scale cs.CL · 2019-11-05 · conditional · none · ref 3 · internal anchor
XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.
Augmenting Self-attention with Persistent Memory cs.LG · 2019-07-02 · unverdicted · none · ref 21 · internal anchor
Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.
Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 53
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 33
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
Generating Long Sequences with Sparse Transformers cs.LG · 2019-04-23 · unverdicted · none · ref 11
Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
Deep Learning Scaling is Predictable, Empirically cs.LG · 2017-12-01 · unverdicted · none · ref 5
Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
Mixed Precision Training cs.AI · 2017-10-10 · accept · none · ref 17
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
Information as Maximum-Caliber Deviation: A bridge between Integrated Information Theory and the Free Energy Principle q-bio.NC · 2026-05-03 · unverdicted · none · ref 151 · internal anchor
Information defined as maximum-caliber deviation derives IIT 3.0 cause-effect repertoires from constrained entropy maximization and equates to prediction error under CLT and LDT.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 298 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Scaling Laws and Interpretability of Learning from Repeated Data cs.LG · 2022-05-21 · accept · none · ref 30 · internal anchor
Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
Compressive Transformers for Long-Range Sequence Modelling cs.LG · 2019-11-13 · unverdicted · none · ref 80 · internal anchor
Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
LaMDA: Language Models for Dialog Applications cs.CL · 2022-01-20 · unverdicted · none · ref 21
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model cs.CL · 2022-01-28 · unverdicted · none · ref 26 · internal anchor
Trained the largest monolithic 530B-parameter transformer language model to date and reported new state-of-the-art zero- and few-shot results on multiple NLP benchmarks.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 221
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Attention Is All You Need cs.CL · 2017-06-12 · unverdicted · none · ref 15
Pith review generated a malformed one-line summary.
Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER cs.CL · 2019-07-25 · unverdicted · none · ref 11 · internal anchor
Cross-lingual fine-tuning of pre-trained LMs yields significant gains on small gold Indonesian NER and competitive results on large silver data versus monolingual LM or POS transfer.
Why Build an Assistant in Minecraft? cs.AI · 2019-07-22 · unverdicted · none · ref 40 · internal anchor
A rationale is presented for developing an assistant in Minecraft to advance natural language understanding and dialogue learning.
Scalable Multi Corpora Neural Language Models for ASR cs.CL · 2019-07-02 · unverdicted · none · ref 10 · internal anchor
The authors report scalable training of neural LMs from heterogeneous corpora for ASR second-pass rescoring, delivering 6.2% relative WER reduction with minimal latency increase.

Exploring the Limits of Language Modeling

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer