arxiv: 1910.03771 · v5 · submitted 2019-10-09 · 💻 cs.CL

Recognition: no theorem link

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Alexander M. Rush, Anthony Moi, Canwen Xu, Clara Ma, Clement Delangue, Joe Davison, Julien Chaumond, Julien Plu, Lysandre Debut, Mariama Drame, Morgan Funtowicz, Patrick von Platen, Pierric Cistac, Quentin Lhoest, R\'emi Louf, Sam Shleifer, Sylvain Gugger, Teven Le Scao, Thomas Wolf, Tim Rault, Victor Sanh, Yacine Jernite

Pith reviewed 2026-05-11 14:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords transformersnatural language processingopen-source librarypretrained modelsunified APImachine learningtransformer architecturesNLP tools

0 comments

The pith

An open-source library supplies a unified API and pretrained models for state-of-the-art Transformer architectures in natural language processing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Transformers library to make recent advances in Transformer models and pretraining available beyond a small group of specialists. It assembles carefully engineered implementations of these architectures behind one consistent interface and pairs them with community-contributed pretrained weights. The design targets three audiences at once: researchers who need to modify or extend the code, practitioners who want simple access to high-performing models, and industrial teams that require reliable, fast deployment. By lowering the cost of reproducing and applying these models, the library aims to accelerate experimentation and adoption across natural language tasks.

Core claim

Transformers is an open-source library that provides state-of-the-art Transformer architectures under a single unified API, together with a curated collection of pretrained models contributed by and available to the community. The library is engineered to be extensible for researchers, straightforward for practitioners, and sufficiently robust and efficient for industrial use.

What carries the argument

The unified API that wraps multiple Transformer architectures while preserving their original performance and allowing consistent access to pretrained weights.

If this is right

New models can be added by researchers without rewriting core training or inference loops.
Practitioners gain immediate access to high-performing models for downstream tasks without reimplementing architectures.
Industrial deployments benefit from a single, maintained codebase that supports multiple frameworks and hardware targets.
Community contributions expand the set of available pretrained models and task-specific fine-tunes.
Standardized interfaces reduce the engineering overhead of comparing or combining different Transformer variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of the shared codebase could shift research focus from reimplementation details to new modeling ideas or data strategies.
If the library remains actively maintained, it may serve as a de-facto reference implementation that influences how future papers release code.
The same API pattern could be extended to other modalities, such as vision or speech, once corresponding Transformer models mature.

Load-bearing premise

The library's implementations must match the accuracy and behavior reported in the original papers that introduced each Transformer model.

What would settle it

A side-by-side benchmark on a standard task such as GLUE or SQuAD in which a model loaded from the library underperforms the numbers published in its source paper would falsify the claim of faithful reproduction.

read the original abstract

Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. \textit{Transformers} is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at \url{https://github.com/huggingface/transformers}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper announces the Transformers library as a unifying tool for pretrained NLP models, and it is a solid engineering release note.

read the letter

This paper is the announcement of the Hugging Face Transformers library. It brings state-of-the-art Transformer architectures together under a single API and backs them with a collection of pretrained models from the community. What is new here is the library as a software artifact that simplifies working with these models. The paper does well in explaining the motivations from recent advances in NLP and in laying out how the library supports different users: researchers adding new models, practitioners using them simply, and industry deploying them robustly. The presentation is honest and matches the abstract's description. There are no internal contradictions or unsubstantiated claims that stand out. The soft spots are that this is not a paper with novel algorithms or empirical results. It is a description of released code, so details like exact reproduction of paper results or long-term maintenance plans are not deeply covered. That is proportionate to the contribution and not a big problem. The library's success depends on ongoing community support, which the paper notes but does not analyze in depth. This is for NLP researchers and developers who want to avoid reinventing model implementations. Readers who need a practical toolkit will find it useful. It deserves a serious referee because the library has had substantial impact on the field. I recommend sending it for peer review.

Referee Report

0 major / 1 minor

Summary. The manuscript describes the Hugging Face Transformers library, an open-source Python package that implements a range of state-of-the-art Transformer architectures for natural language processing under a single, consistent API. It is backed by a curated collection of community-contributed pretrained models and is positioned as extensible for researchers, simple for practitioners, and robust for industrial deployment. The library is hosted at https://github.com/huggingface/transformers.

Significance. If the described implementations and pretrained weights are faithful to the original papers, the work is significant because it lowers the barrier to using high-capacity Transformer models, promotes reproducibility through open weights and code, and accelerates both research and deployment in NLP. The emphasis on a unified API and community contributions is a concrete strength that directly supports the paper's stated goals.

minor comments (1)

[Abstract] Abstract: the phrase 'state-of-the art' is missing a hyphen and should read 'state-of-the-art'.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, recognition of the library's role in lowering barriers to Transformer models, and recommendation to accept. We appreciate the emphasis on the unified API and community contributions as key strengths.

Circularity Check

0 steps flagged

No significant circularity; factual software documentation

full rationale

The paper is an announcement and documentation of the Hugging Face Transformers open-source library. It describes goals, design principles, and availability of a software package with pretrained models under a unified API. No mathematical derivations, equations, fitted parameters, predictions of new quantities, or self-referential claims appear. The central claim is the existence and features of publicly available code, externally verifiable via the GitHub URL and community contributions. No load-bearing steps reduce to inputs by construction, and the document contains no self-citation chains or uniqueness theorems invoked to justify internal results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software library announcement paper containing no mathematical derivations, fitted constants, or postulated physical entities.

pith-pipeline@v0.9.0 · 5514 in / 991 out tokens · 27196 ms · 2026-05-11T14:49:06.688795+00:00 · methodology

discussion (0)

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
cs.AR 2026-05 conditional novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
cs.CL 2026-05 unverdicted novelty 7.0

TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints
cs.CL 2026-05 unverdicted novelty 7.0

EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
cs.LG 2026-05 unverdicted novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
cs.LG 2026-04 unverdicted novelty 7.0

Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.
SecureRouter: Encrypted Routing for Efficient Secure Inference
cs.CR 2026-04 unverdicted novelty 7.0

SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
VertAX: a differentiable vertex model for learning epithelial tissue mechanics
cs.LG 2026-04 unverdicted novelty 7.0

VertAX supplies a differentiable JAX implementation of vertex models for confluent epithelia that enables forward simulation, mechanical parameter inference, and inverse design of tissue-scale behaviors.
Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
cs.CV 2026-04 unverdicted novelty 7.0

Visual attention in MLLMs shows inertia that hinders cognitive inference on object relations, addressed by a training-free Inertia-aware Visual Excitation method that selects dynamically emerging tokens and applies an...
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
High-Resolution Image Synthesis with Latent Diffusion Models
cs.CV 2021-12 conditional novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
cs.CL 2020-05 accept novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
Large Spectrum Models (LSMs): Decoder-Only Transformer-Powered Spectrum Activity Forecasting via Tokenized RF Data
cs.NI 2026-05 unverdicted novelty 6.0

Decoder-only transformers trained on tokenized RF spectrum data from 22 TB of measurements achieve 3.25 dB RMSE in spectrum activity forecasting across 33 bands.
Query-efficient model evaluation using cached responses
cs.LG 2026-05 unverdicted novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
ModelLens: Finding the Best for Your Task from Myriads of Models
cs.LG 2026-05 unverdicted novelty 6.0

ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
BAMI: Training-Free Bias Mitigation in GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
Scaling Pretrained Representations Enables Label-Free Out-of-Distribution Detection Without Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

Scaling pretrained representations improves label-free OOD detection on frozen backbones, causing performance gaps between global and local detectors to vanish across vision and language tasks.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
cs.CR 2026-05 conditional novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
cs.LG 2026-04 unverdicted novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
cs.SE 2026-04 unverdicted novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
cs.CV 2026-04 conditional novelty 6.0

R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

RePrompT uses recurrent prompt tuning to inject prior-visit latent states and cohort-derived population prompt tokens into LLMs, yielding better performance than pure EHR or pure LLM baselines on MIMIC clinical predic...
Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs
cs.CL 2026-04 unverdicted novelty 6.0

Causal interventions reveal that coordination islands block filler-gap mechanisms in Transformers in a gradient way matching humans, yielding the hypothesis that 'and' encodes relational dependencies differently in ex...
SeLaR: Selective Latent Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
Rethinking Residual Errors in Compensation-based LLM Quantization
cs.LG 2026-04 conditional novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge
cs.DC 2026-04 unverdicted novelty 6.0

ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, imp...
Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits
cs.LG 2026-04 unverdicted novelty 6.0

LLM warm-starts for bandits remain better than cold-starts up to roughly 30% random label noise but increase regret under systematic misalignment, with a derived sufficient condition on prior error that predicts when ...
MemFactory: Unified Inference & Training Framework for Agent Memory
cs.CL 2026-03 unverdicted novelty 6.0

MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
HybridFlow: A Flexible and Efficient RLHF Framework
cs.LG 2024-09 unverdicted novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
Steering Llama 2 via Contrastive Activation Addition
cs.CL 2023-12 unverdicted novelty 6.0

Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
cs.CL 2023-03 unverdicted novelty 6.0

AdaLoRA uses SVD-based pruning to allocate the parameter budget for low-rank fine-tuning updates according to per-matrix importance scores, yielding better performance than uniform allocation especially under tight budgets.
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
cs.CV 2023-03 accept novelty 6.0

Grounding DINO fuses language and vision via feature enhancer, language-guided query selection, and cross-modality decoder in a DINO backbone, achieving 52.5 AP zero-shot on COCO and a new record of 26.1 AP mean on ODinW.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
cs.AI 2026-05 unverdicted novelty 5.0

Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
Reasoning Compression with Mixed-Policy Distillation
cs.AI 2026-05 unverdicted novelty 5.0

Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer
cs.CL 2026-05 unverdicted novelty 5.0

EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.
GiVA: Gradient-Informed Bases for Vector-Based Adaptation
cs.CL 2026-04 unverdicted novelty 5.0

GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.
Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts
cs.SE 2026-04 conditional novelty 5.0

STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.
Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation
cs.CV 2026-04 unverdicted novelty 5.0

A latent diffusion model conditioned on line drawings estimates dense depth to reconstruct 3D wireframes, reporting 5.3% average depth error after training on over one million pairs.
FedSpy-LLM: Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLMs
cs.CR 2026-04 unverdicted novelty 5.0

FedSpy-LLM uses gradient decomposition and iterative alignment to reconstruct larger batches and longer sequences of training data from LLM gradients in federated settings, including with PEFT methods.
OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis
cs.CR 2026-04 unverdicted novelty 4.0

LoRA fine-tuning of TinyLlama-1.1B on 450 SOC examples produces 68% threat classification accuracy and 58% severity accuracy on 50 held-out logs, with full code, weights, and data released.

Reference graph

Works this paper leans on

161 extracted references · 161 canonical work pages · cited by 47 Pith papers · 15 internal anchors

[1]

Contextual String Embeddings for Sequence Labeling , author=

work page
[2]

Pooled Contextualized Embeddings for Named Entity Recognition , author=

work page
[3]

Gomez and Stephan Gouws and Llion Jones and

Ashish Vaswani and Samy Bengio and Eugene Brevdo and Francois Chollet and Aidan N. Gomez and Stephan Gouws and Llion Jones and. Tensor2Tensor for Neural Machine Translation , journal =. 2018 , url =

work page 2018
[4]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

Large-scale transfer learning for natural language generation , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[5]

Decoupled Weight Decay Regularization

Fixing weight decay regularization in adam , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials , pages=

Transfer Learning in Natural Language Processing , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials , pages=

work page 2019
[10]

NAACL-HLT , year=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. NAACL-HLT , year=

work page
[11]

Language Models are Unsupervised Multitask Learners , author=

work page
[12]

Improving Language Understanding by Generative Pre-Training , author=

work page
[13]

ArXiv , year=

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. ArXiv , year=

work page
[14]

EMNLP , year=

Dissecting Contextual Word Embeddings: Architecture and Representation , author=. EMNLP , year=

work page
[15]

ACL , year=

BERT Rediscovers the Classical NLP Pipeline , author=. ACL , year=

work page
[16]

NeurIPS , year=

Are Sixteen Heads Really Better than One? , author=. NeurIPS , year=

work page
[17]

BlackBoxNLP@ACL , year =

What Does BERT Look At? An Analysis of BERT's Attention , author =. BlackBoxNLP@ACL , year =

work page
[18]

ICLR , year=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. ICLR , year=

work page
[19]

ArXiv , year=

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , author=. ArXiv , year=

work page
[20]

EMNLP , year=

SQuAD: 100, 000+ Questions for Machine Comprehension of Text , author=. EMNLP , year=

work page
[21]

Automatic Differentiation in

Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam , booktitle=. Automatic Differentiation in

work page
[22]

2017 , Note =

Honnibal, Matthew and Montani, Ines , TITLE =. 2017 , Note =

work page 2017
[23]

2018 , booktitle=

AllenNLP: A Deep Semantic Natural Language Processing Platform , author=. 2018 , booktitle=

work page 2018
[24]

ACL , year=

Universal Language Model Fine-tuning for Text Classification , author=. ACL , year=

work page
[25]

EMNLP , year=

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , author=. EMNLP , year=

work page
[26]

EMNLP , year=

RACE: Large-scale ReAding Comprehension Dataset From Examinations , author=. EMNLP , year=

work page
[27]

ArXiv , year=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=

work page
[28]

ACL , year=

Know What You Don't Know: Unanswerable Questions for SQuAD , author=. ACL , year=

work page
[29]

ArXiv , year=

The Curious Case of Neural Text Degeneration , author=. ArXiv , year=

work page
[30]

ACL , year=

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , author=. ACL , year=

work page
[31]

ArXiv , year=

XLNet: Generalized Autoregressive Pretraining for Language Understanding , author=. ArXiv , year=

work page
[32]

NeurIPS , year=

Cross-lingual Language Model Pretraining , author=. NeurIPS , year=

work page
[33]

CoRR , volume =

Thomas Wolf and Victor Sanh and Julien Chaumond and Clement Delangue , title =. CoRR , volume =. 2019 , url =

work page 2019
[34]

NeurIPS ConvAI Wokshop , year=

TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents , author=. NeurIPS ConvAI Wokshop , year=

work page
[35]

COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , year =

Antoine Bosselut and Hannah Rashkin and Maarten Sap and Chaitanya Malaviya and Asli Çelikyilmaz and Yejin Choi , booktitle =. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , year =

work page
[36]

ICLR , year=

Adam: A Method for Stochastic Optimization , author=. ICLR , year=

work page
[37]

Neural network acceptability judgments

Neural Network Acceptability Judgments , author=. arXiv preprint 1805.12471 , year=

work page arXiv
[38]

Proceedings of EMNLP , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of EMNLP , pages=

work page
[39]

Proceedings of the International Workshop on Paraphrasing , year=

Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the International Workshop on Paraphrasing , year=

work page
[40]

2007 , address =

Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007) , month =. 2007 , address =

work page 2007
[41]

, title =

Williams, Adina and Nangia, Nikita and Bowman, Samuel R. , title =

work page
[42]

Dagan, Ido and Glickman, Oren and Magnini, Bernardo , booktitle=. The. 2006 , publisher=

work page 2006
[43]

The second

Bar Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , year=. The second

work page
[44]

The third

Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill , booktitle=. The third. 2007 , organization=

work page 2007
[45]

The Fifth

Bentivogli, Luisa and Dagan, Ido and Dang, Hoa Trang and Giampiccolo, Danilo and Magnini, Bernardo , booktitle=. The Fifth

work page
[46]

Levesque, Hector J and Davis, Ernest and Morgenstern, Leora , booktitle=. The

work page
[47]

Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel , title =

work page
[48]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=

work page
[49]

De Marneffe, Marie-Catherine and Simons, Mandy and Tonhauser, Judith , note=

work page
[50]

2011 AAAI Spring Symposium Series , year=

Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

work page 2011
[51]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

work page 2018
[52]

Sheng Zhang and Xiaodong Liu and Jingjing Liu and Jianfeng Gao and Kevin Duh and Benjamin Van Durme , journal=

work page
[53]

Pilehvar, Mohammad Taher and Camacho-Collados, Jose , booktitle=

work page
[54]

Proceedings of NAACL-HLT , year=

Gender Bias in Coreference Resolution , author=. Proceedings of NAACL-HLT , year=

work page
[55]

Proceedings of EMNLP , year=

Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , author=. Proceedings of EMNLP , year=

work page
[57]

Learned in Translation: Contextualized Word Vectors

McCann, Bryan and Bradbury, James and Xiong, Caiming and Socher, Richard. Learned in Translation: Contextualized Word Vectors. Advances in Neural Information Processing Systems 30

work page
[58]

Are Sixteen Heads Really Better than One?

Michel, Paul and Levy, Omer and Neubig, Graham. Are Sixteen Heads Really Better than One?. arXiv:1905.10650

work page arXiv 1905
[65]

IEEE transactions on visualization and computer graphics , volume=

Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks , author=. IEEE transactions on visualization and computer graphics , volume=. 2017 , publisher=

work page 2017
[66]

FlauBERT: Unsupervised Language Model Pre-training for French , booktitle =

Le, Hang and Vial, Lo\". FlauBERT: Unsupervised Language Model Pre-training for French , booktitle =. 2020 , address =

work page 2020
[67]

Flair: An easy-to-use framework for state-of-the-art nlp

Akbik, Alan and Bergmann, Tanja and Blythe, Duncan and Rasul, Kashif and Schweter, Stefan and Vollgraf, Roland. Flair: An easy-to-use framework for state-of-the-art nlp. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

work page 2019
[68]

The Stanford CoreNLP natural language processing toolkit

Manning, Christopher D and Surdeanu, Mihai and Bauer, John and Finkel, Jenny Rose and Bethard, Steven and McClosky, David. The Stanford CoreNLP natural language processing toolkit. Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations

work page
[70]

TensorFlow

TensorFlow Hub. TensorFlow

work page
[72]

Summary of the models --- transformers 3.0.0 documentation

work page
[76]

SciBERT : A Pretrained Language Model for Scientific Text

Beltagy, Iz and Lo, Kyle and Cohan, Arman. SciBERT : A Pretrained Language Model for Scientific Text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJCNLP )

work page 2019
[84]

spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing

Honnibal, Matthew and Montani, Ines. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear

work page
[86]

preprint arXiv:1701.02810 , year=

Klein, Guillaume and Kim, Yoon and Deng, Yuntian and Senellart, Jean and Rush, Alexander M. OpenNMT : Open-Source Toolkit for Neural Machine Translation. arXiv:1701.02810

work page arXiv
[88]

James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and Skye Wanderman-Milne , title =

work page
[89]

13th \ USENIX \ Symposium on Operating Systems Design and Implementation ( \ OSDI \ 18) , pages=

\ TVM \ : An automated end-to-end optimizing compiler for deep learning , author=. 13th \ USENIX \ Symposium on Operating Systems Design and Implementation ( \ OSDI \ 18) , pages=

work page
[90]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Nils and Gurevych, Iryna. Sentence-BERT : Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 1908
[92]

Language models are unsupervised multitask learners

Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya. Language models are unsupervised multitask learners. OpenAI Blog

work page
[94]

Attention is All you Need

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Ukasz and Polosukhin, Illia. Attention is All you Need. Advances in Neural Information Processing Systems 30

work page
[95]

Microsoft's Conference Management Toolkit

work page
[96]

Change.org

Sign the Petition. Change.org

work page
[97]

Learning to love virtual conferences in the coronavirus era

Woolston, Chris. Learning to love virtual conferences in the coronavirus era. Nature

work page
[98]

Latent Alignment and Variational Attention

Deng, Yuntian and Kim, Yoon and Chiu, Justin and Guo, Demi and Rush, Alexander M. Latent Alignment and Variational Attention. arXiv:1807.03756

work page arXiv
[99]

Structured Neural Topic Models for Reviews

Esmaeili, Babak and Huang, Hongyi and Wallace, Byron C and van de Meent, Jan-Willem. Structured Neural Topic Models for Reviews. arXiv:1812.05035

work page arXiv
[100]

An estimate of an upper bound for the entropy of English

Brown, Peter F and Pietra, Vincent J Della and Mercer, Robert L and Pietra, Stephen A Della and Lai, Jennifer C. An estimate of an upper bound for the entropy of English. Comput. Linguist

work page
[101]

Prediction and Entropy of Printed English

Shannon, C E. Prediction and Entropy of Printed English. Bell System Technical Journal

work page
[102]

Dyna: A declarative language for implementing dynamic programs

Eisner, Jason and Goldlust, Eric and Smith, Noah A. Dyna: A declarative language for implementing dynamic programs. Proceedings of the ACL 2004 on Interactive poster and demonstration sessions

work page 2004
[103]

Quick Training of Probabilistic Neural Nets by Importance Sampling

Bengio, Yoshua and Sen \'e cal, Jean-S \'e bastien and Others. Quick Training of Probabilistic Neural Nets by Importance Sampling. AISTATS

work page
[104]

Bowman, Luke Vilnis, Oriol Vinyals, Andrew M

Bowman, Samuel R and Vilnis, Luke and Vinyals, Oriol and Dai, Andrew M and Jozefowicz, Rafal and Bengio, Samy. Generating Sentences from a Continuous Space. arXiv:1511.06349

work page arXiv
[105]

Categorical Reparameterization with Gumbel-Softmax

Jang, Eric and Gu, Shixiang and Poole, Ben. Categorical Reparameterization with Gumbel-Softmax. arXiv:1611.01144

work page internal anchor Pith review arXiv
[106]

Differentiable Perturb-and-Parse : Semi-Supervised Parsing with a Structured Variational Autoencoder

Corro, Caio and Titov, Ivan. Differentiable Perturb-and-Parse : Semi-Supervised Parsing with a Structured Variational Autoencoder. arXiv:1807.09875

work page arXiv

Showing first 80 references.