hub

Generating Wikipedia by Summarizing Long Sequences

Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer · 2018 · cs.CL · arXiv 1801.10198

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

open full Pith review browse 15 citing papers arXiv PDF

abstract

We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction. We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. When given reference documents, we show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Reformer: The Efficient Transformer

cs.LG · 2020-01-13 · accept · novelty 8.0

Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.

A decoder-only foundation model for time-series forecasting

cs.CL · 2023-10-14 · unverdicted · novelty 7.0

A pretrained decoder-only patched transformer achieves near state-of-the-art zero-shot forecasting performance across diverse time series datasets and settings.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Steering Language Models With Activation Engineering

cs.CL · 2023-08-20 · unverdicted · novelty 7.0

Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.

Scaling Laws for Autoregressive Generative Modeling

cs.LG · 2020-10-28 · accept · novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

Generating Long Sequences with Sparse Transformers

cs.LG · 2019-04-23 · unverdicted · novelty 7.0

Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

BioGen: Automated Biography Generation

cs.DL · 2019-06-27 · unverdicted · novelty 3.0

BioGen generates biographical sentences clustered by life events and claims its outputs are significantly closer to Wikipedia biographies than prior methods.

Machine Reading Comprehension: a Literature Review

cs.CL · 2019-06-30 · unverdicted · novelty 1.0

A 2019 survey of machine reading comprehension corpora and methods.

citing papers explorer

Showing 15 of 15 citing papers.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling cs.CL · 2020-12-31 · conditional · none · ref 212
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 46
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Reformer: The Efficient Transformer cs.LG · 2020-01-13 · accept · none · ref 12
Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.
A decoder-only foundation model for time-series forecasting cs.CL · 2023-10-14 · unverdicted · none · ref 13 · internal anchor
A pretrained decoder-only patched transformer achieves near state-of-the-art zero-shot forecasting performance across diverse time series datasets and settings.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 152
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Steering Language Models With Activation Engineering cs.CL · 2023-08-20 · unverdicted · none · ref 78
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
Scaling Laws for Autoregressive Generative Modeling cs.LG · 2020-10-28 · accept · none · ref 13
Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 44
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
Generating Long Sequences with Sparse Transformers cs.LG · 2019-04-23 · unverdicted · none · ref 13
Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning cs.LG · 2026-05-16 · unverdicted · none · ref 56 · internal anchor
Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 38 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 124
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 237
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
BioGen: Automated Biography Generation cs.DL · 2019-06-27 · unverdicted · none · ref 7 · internal anchor
BioGen generates biographical sentences clustered by life events and claims its outputs are significantly closer to Wikipedia biographies than prior methods.
Machine Reading Comprehension: a Literature Review cs.CL · 2019-06-30 · unverdicted · none · ref 28 · internal anchor
A 2019 survey of machine reading comprehension corpora and methods.

Generating Wikipedia by Summarizing Long Sequences

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer