pith. sign in

super hub Mixed citations

Pointer Sentinel Mixture Models

Mixed citation behavior. Most common role is background (56%).

159 Pith papers citing it
Background 56% of classified citations
abstract

Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.

hub tools

citation-role summary

background 9 dataset 5 method 1 other 1

citation-polarity summary

claims ledger

  • abstract Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Tree

authors

co-cited works

clear filters

representative citing papers

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

TallyTrain: Communication-Efficient Federated Distillation

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TallyTrain is a hard-label distillation protocol for federated learning that uses argmax transmission and optional sparse merges to match soft-label performance at up to 1000x lower communication cost.

Decomposing how prompting steers behavior

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • SpinQuant: LLM quantization with learned rotations cs.LG · 2024-05-26 · conditional · none · ref 13 · internal anchor

    SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.

  • Chronos: Learning the Language of Time Series cs.LG · 2024-03-12 · conditional · none · ref 57 · internal anchor

    Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.

  • Massive Activations in Large Language Models cs.CL · 2024-02-27 · unverdicted · none · ref 134 · internal anchor

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  • How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP cs.CL · 2024-11-08 · unverdicted · none · ref 40 · internal anchor

    The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.