pith. machine review for the scientific record. sign in

arxiv: 1910.03771 · v5 · submitted 2019-10-09 · 💻 cs.CL

Recognition: no theorem link

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Alexander M. Rush, Anthony Moi, Canwen Xu, Clara Ma, Clement Delangue, Joe Davison, Julien Chaumond, Julien Plu, Lysandre Debut, Mariama Drame, Morgan Funtowicz, Patrick von Platen, Pierric Cistac, Quentin Lhoest, R\'emi Louf, Sam Shleifer, Sylvain Gugger, Teven Le Scao, Thomas Wolf, Tim Rault, Victor Sanh, Yacine Jernite

Pith reviewed 2026-05-11 14:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords transformersnatural language processingopen-source librarypretrained modelsunified APImachine learningtransformer architecturesNLP tools
0
0 comments X

The pith

An open-source library supplies a unified API and pretrained models for state-of-the-art Transformer architectures in natural language processing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Transformers library to make recent advances in Transformer models and pretraining available beyond a small group of specialists. It assembles carefully engineered implementations of these architectures behind one consistent interface and pairs them with community-contributed pretrained weights. The design targets three audiences at once: researchers who need to modify or extend the code, practitioners who want simple access to high-performing models, and industrial teams that require reliable, fast deployment. By lowering the cost of reproducing and applying these models, the library aims to accelerate experimentation and adoption across natural language tasks.

Core claim

Transformers is an open-source library that provides state-of-the-art Transformer architectures under a single unified API, together with a curated collection of pretrained models contributed by and available to the community. The library is engineered to be extensible for researchers, straightforward for practitioners, and sufficiently robust and efficient for industrial use.

What carries the argument

The unified API that wraps multiple Transformer architectures while preserving their original performance and allowing consistent access to pretrained weights.

If this is right

  • New models can be added by researchers without rewriting core training or inference loops.
  • Practitioners gain immediate access to high-performing models for downstream tasks without reimplementing architectures.
  • Industrial deployments benefit from a single, maintained codebase that supports multiple frameworks and hardware targets.
  • Community contributions expand the set of available pretrained models and task-specific fine-tunes.
  • Standardized interfaces reduce the engineering overhead of comparing or combining different Transformer variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of the shared codebase could shift research focus from reimplementation details to new modeling ideas or data strategies.
  • If the library remains actively maintained, it may serve as a de-facto reference implementation that influences how future papers release code.
  • The same API pattern could be extended to other modalities, such as vision or speech, once corresponding Transformer models mature.

Load-bearing premise

The library's implementations must match the accuracy and behavior reported in the original papers that introduced each Transformer model.

What would settle it

A side-by-side benchmark on a standard task such as GLUE or SQuAD in which a model loaded from the library underperforms the numbers published in its source paper would falsify the claim of faithful reproduction.

read the original abstract

Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. \textit{Transformers} is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. \textit{Transformers} is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at \url{https://github.com/huggingface/transformers}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript describes the Hugging Face Transformers library, an open-source Python package that implements a range of state-of-the-art Transformer architectures for natural language processing under a single, consistent API. It is backed by a curated collection of community-contributed pretrained models and is positioned as extensible for researchers, simple for practitioners, and robust for industrial deployment. The library is hosted at https://github.com/huggingface/transformers.

Significance. If the described implementations and pretrained weights are faithful to the original papers, the work is significant because it lowers the barrier to using high-capacity Transformer models, promotes reproducibility through open weights and code, and accelerates both research and deployment in NLP. The emphasis on a unified API and community contributions is a concrete strength that directly supports the paper's stated goals.

minor comments (1)
  1. [Abstract] Abstract: the phrase 'state-of-the art' is missing a hyphen and should read 'state-of-the-art'.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, recognition of the library's role in lowering barriers to Transformer models, and recommendation to accept. We appreciate the emphasis on the unified API and community contributions as key strengths.

Circularity Check

0 steps flagged

No significant circularity; factual software documentation

full rationale

The paper is an announcement and documentation of the Hugging Face Transformers open-source library. It describes goals, design principles, and availability of a software package with pretrained models under a unified API. No mathematical derivations, equations, fitted parameters, predictions of new quantities, or self-referential claims appear. The central claim is the existence and features of publicly available code, externally verifiable via the GitHub URL and community contributions. No load-bearing steps reduce to inputs by construction, and the document contains no self-citation chains or uniqueness theorems invoked to justify internal results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software library announcement paper containing no mathematical derivations, fitted constants, or postulated physical entities.

pith-pipeline@v0.9.0 · 5514 in / 991 out tokens · 27196 ms · 2026-05-11T14:49:06.688795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

    cs.AR 2026-05 conditional novelty 8.0

    Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

  2. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  3. RULER: What's the Real Context Size of Your Long-Context Language Models?

    cs.CL 2024-04 accept novelty 8.0

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  4. Editing Models with Task Arithmetic

    cs.LG 2022-12 accept novelty 8.0

    Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

  5. TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

    cs.CL 2026-05 unverdicted novelty 7.0

    TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.

  6. EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

    cs.CL 2026-05 unverdicted novelty 7.0

    EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.

  7. How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...

  8. Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

    cs.LG 2026-05 unverdicted novelty 7.0

    Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

  9. Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

    cs.LG 2026-04 unverdicted novelty 7.0

    Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.

  10. SecureRouter: Encrypted Routing for Efficient Secure Inference

    cs.CR 2026-04 unverdicted novelty 7.0

    SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.

  11. Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.

  12. VertAX: a differentiable vertex model for learning epithelial tissue mechanics

    cs.LG 2026-04 unverdicted novelty 7.0

    VertAX supplies a differentiable JAX implementation of vertex models for confluent epithelia that enables forward simulation, mechanical parameter inference, and inverse design of tissue-scale behaviors.

  13. Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual attention in MLLMs shows inertia that hinders cognitive inference on object relations, addressed by a training-free Inertia-aware Visual Excitation method that selects dynamically emerging tokens and applies an...

  14. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  15. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  16. High-Resolution Image Synthesis with Latent Diffusion Models

    cs.CV 2021-12 conditional novelty 7.0

    Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...

  17. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    cs.CL 2020-05 accept novelty 7.0

    RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

  18. Large Spectrum Models (LSMs): Decoder-Only Transformer-Powered Spectrum Activity Forecasting via Tokenized RF Data

    cs.NI 2026-05 unverdicted novelty 6.0

    Decoder-only transformers trained on tokenized RF spectrum data from 22 TB of measurements achieve 3.25 dB RMSE in spectrum activity forecasting across 33 bands.

  19. Query-efficient model evaluation using cached responses

    cs.LG 2026-05 unverdicted novelty 6.0

    DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

  20. ModelLens: Finding the Best for Your Task from Myriads of Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.

  21. Why Does Agentic Safety Fail to Generalize Across Tasks?

    cs.LG 2026-05 conditional novelty 6.0

    Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...

  22. BAMI: Training-Free Bias Mitigation in GUI Grounding

    cs.CV 2026-05 unverdicted novelty 6.0

    BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.

  23. Scaling Pretrained Representations Enables Label-Free Out-of-Distribution Detection Without Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    Scaling pretrained representations improves label-free OOD detection on frozen backbones, causing performance gaps between global and local detectors to vanish across vision and language tasks.

  24. On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

    cs.CR 2026-05 conditional novelty 6.0

    An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.

  25. When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

    cs.LG 2026-04 unverdicted novelty 6.0

    Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...

  26. Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

    cs.SE 2026-04 unverdicted novelty 6.0

    Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

  27. R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

    cs.CV 2026-04 conditional novelty 6.0

    R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.

  28. RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    RePrompT uses recurrent prompt tuning to inject prior-visit latent states and cohort-derived population prompt tokens into LLMs, yielding better performance than pure EHR or pure LLM baselines on MIMIC clinical predic...

  29. Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Causal interventions reveal that coordination islands block filler-gap mechanisms in Transformers in a gradient way matching humans, yielding the hypothesis that 'and' encodes relational dependencies differently in ex...

  30. SeLaR: Selective Latent Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.

  31. Rethinking Residual Errors in Compensation-based LLM Quantization

    cs.LG 2026-04 conditional novelty 6.0

    Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

  32. Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge

    cs.DC 2026-04 unverdicted novelty 6.0

    ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, imp...

  33. Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits

    cs.LG 2026-04 unverdicted novelty 6.0

    LLM warm-starts for bandits remain better than cold-starts up to roughly 30% random label noise but increase regret under systematic misalignment, with a derived sufficient condition on prior error that predicts when ...

  34. MemFactory: Unified Inference & Training Framework for Agent Memory

    cs.CL 2026-03 unverdicted novelty 6.0

    MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.

  35. HybridFlow: A Flexible and Efficient RLHF Framework

    cs.LG 2024-09 unverdicted novelty 6.0

    HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

  36. OpenVLA: An Open-Source Vision-Language-Action Model

    cs.RO 2024-06 unverdicted novelty 6.0

    OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

  37. Steering Llama 2 via Contrastive Activation Addition

    cs.CL 2023-12 unverdicted novelty 6.0

    Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.

  38. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    cs.CL 2023-03 unverdicted novelty 6.0

    AdaLoRA uses SVD-based pruning to allocate the parameter budget for low-rank fine-tuning updates according to per-matrix importance scores, yielding better performance than uniform allocation especially under tight budgets.

  39. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    cs.CV 2023-03 accept novelty 6.0

    Grounding DINO fuses language and vision via feature enhancer, language-guided query selection, and cross-modality decoder in a DINO backbone, achieving 52.5 AP zero-shot on COCO and a new record of 26.1 AP mean on ODinW.

  40. Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

  41. Reasoning Compression with Mixed-Policy Distillation

    cs.AI 2026-05 unverdicted novelty 5.0

    Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.

  42. EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer

    cs.CL 2026-05 unverdicted novelty 5.0

    EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.

  43. GiVA: Gradient-Informed Bases for Vector-Based Adaptation

    cs.CL 2026-04 unverdicted novelty 5.0

    GiVA uses gradients to initialize vector adapters so they match LoRA performance at eight times lower rank while keeping extreme parameter efficiency.

  44. Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts

    cs.SE 2026-04 conditional novelty 5.0

    STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.

  45. Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation

    cs.CV 2026-04 unverdicted novelty 5.0

    A latent diffusion model conditioned on line drawings estimates dense depth to reconstruct 3D wireframes, reporting 5.3% average depth error after training on over one million pairs.

  46. FedSpy-LLM: Towards Scalable and Generalizable Data Reconstruction Attacks from Gradients on LLMs

    cs.CR 2026-04 unverdicted novelty 5.0

    FedSpy-LLM uses gradient decomposition and iterative alignment to reconstruct larger batches and longer sequences of training data from LLM gradients in federated settings, including with PEFT methods.

  47. OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis

    cs.CR 2026-04 unverdicted novelty 4.0

    LoRA fine-tuning of TinyLlama-1.1B on 450 SOC examples produces 68% threat classification accuracy and 58% severity accuracy on 50 held-out logs, with full code, weights, and data released.

Reference graph

Works this paper leans on

161 extracted references · 161 canonical work pages · cited by 47 Pith papers · 15 internal anchors

  1. [1]

    Contextual String Embeddings for Sequence Labeling , author=

  2. [2]

    Pooled Contextualized Embeddings for Named Entity Recognition , author=

  3. [3]

    Gomez and Stephan Gouws and Llion Jones and

    Ashish Vaswani and Samy Bengio and Eugene Brevdo and Francois Chollet and Aidan N. Gomez and Stephan Gouws and Llion Jones and. Tensor2Tensor for Neural Machine Translation , journal =. 2018 , url =

  4. [4]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

    Large-scale transfer learning for natural language generation , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

  5. [5]

    Decoupled Weight Decay Regularization

    Fixing weight decay regularization in adam , author=. arXiv preprint arXiv:1711.05101 , year=

  6. [9]

    Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials , pages=

    Transfer Learning in Natural Language Processing , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials , pages=

  7. [10]

    NAACL-HLT , year=

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. NAACL-HLT , year=

  8. [11]

    Language Models are Unsupervised Multitask Learners , author=

  9. [12]

    Improving Language Understanding by Generative Pre-Training , author=

  10. [13]

    ArXiv , year=

    RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. ArXiv , year=

  11. [14]

    EMNLP , year=

    Dissecting Contextual Word Embeddings: Architecture and Representation , author=. EMNLP , year=

  12. [15]

    ACL , year=

    BERT Rediscovers the Classical NLP Pipeline , author=. ACL , year=

  13. [16]

    NeurIPS , year=

    Are Sixteen Heads Really Better than One? , author=. NeurIPS , year=

  14. [17]

    BlackBoxNLP@ACL , year =

    What Does BERT Look At? An Analysis of BERT's Attention , author =. BlackBoxNLP@ACL , year =

  15. [18]

    ICLR , year=

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. ICLR , year=

  16. [19]

    ArXiv , year=

    SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , author=. ArXiv , year=

  17. [20]

    EMNLP , year=

    SQuAD: 100, 000+ Questions for Machine Comprehension of Text , author=. EMNLP , year=

  18. [21]

    Automatic Differentiation in

    Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam , booktitle=. Automatic Differentiation in

  19. [22]

    2017 , Note =

    Honnibal, Matthew and Montani, Ines , TITLE =. 2017 , Note =

  20. [23]

    2018 , booktitle=

    AllenNLP: A Deep Semantic Natural Language Processing Platform , author=. 2018 , booktitle=

  21. [24]

    ACL , year=

    Universal Language Model Fine-tuning for Text Classification , author=. ACL , year=

  22. [25]

    EMNLP , year=

    SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , author=. EMNLP , year=

  23. [26]

    EMNLP , year=

    RACE: Large-scale ReAding Comprehension Dataset From Examinations , author=. EMNLP , year=

  24. [27]

    ArXiv , year=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=

  25. [28]

    ACL , year=

    Know What You Don't Know: Unanswerable Questions for SQuAD , author=. ACL , year=

  26. [29]

    ArXiv , year=

    The Curious Case of Neural Text Degeneration , author=. ArXiv , year=

  27. [30]

    ACL , year=

    Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , author=. ACL , year=

  28. [31]

    ArXiv , year=

    XLNet: Generalized Autoregressive Pretraining for Language Understanding , author=. ArXiv , year=

  29. [32]

    NeurIPS , year=

    Cross-lingual Language Model Pretraining , author=. NeurIPS , year=

  30. [33]

    CoRR , volume =

    Thomas Wolf and Victor Sanh and Julien Chaumond and Clement Delangue , title =. CoRR , volume =. 2019 , url =

  31. [34]

    NeurIPS ConvAI Wokshop , year=

    TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents , author=. NeurIPS ConvAI Wokshop , year=

  32. [35]

    COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , year =

    Antoine Bosselut and Hannah Rashkin and Maarten Sap and Chaitanya Malaviya and Asli Çelikyilmaz and Yejin Choi , booktitle =. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , year =

  33. [36]

    ICLR , year=

    Adam: A Method for Stochastic Optimization , author=. ICLR , year=

  34. [37]

    Neural network acceptability judgments

    Neural Network Acceptability Judgments , author=. arXiv preprint 1805.12471 , year=

  35. [38]

    Proceedings of EMNLP , pages=

    Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of EMNLP , pages=

  36. [39]

    Proceedings of the International Workshop on Paraphrasing , year=

    Automatically constructing a corpus of sentential paraphrases , author=. Proceedings of the International Workshop on Paraphrasing , year=

  37. [40]

    2007 , address =

    Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007) , month =. 2007 , address =

  38. [41]

    , title =

    Williams, Adina and Nangia, Nikita and Bowman, Samuel R. , title =

  39. [42]

    Dagan, Ido and Glickman, Oren and Magnini, Bernardo , booktitle=. The. 2006 , publisher=

  40. [43]

    The second

    Bar Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , year=. The second

  41. [44]

    The third

    Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill , booktitle=. The third. 2007 , organization=

  42. [45]

    The Fifth

    Bentivogli, Luisa and Dagan, Ido and Dang, Hoa Trang and Giampiccolo, Danilo and Magnini, Bernardo , booktitle=. The Fifth

  43. [46]

    Levesque, Hector J and Davis, Ernest and Morgenstern, Leora , booktitle=. The

  44. [47]

    Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel , title =

  45. [48]

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=

  46. [49]

    De Marneffe, Marie-Catherine and Simons, Mandy and Tonhauser, Judith , note=

  47. [50]

    2011 AAAI Spring Symposium Series , year=

    Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

  48. [51]

    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

    Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

  49. [52]

    Sheng Zhang and Xiaodong Liu and Jingjing Liu and Jianfeng Gao and Kevin Duh and Benjamin Van Durme , journal=

  50. [53]

    Pilehvar, Mohammad Taher and Camacho-Collados, Jose , booktitle=

  51. [54]

    Proceedings of NAACL-HLT , year=

    Gender Bias in Coreference Resolution , author=. Proceedings of NAACL-HLT , year=

  52. [55]

    Proceedings of EMNLP , year=

    Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , author=. Proceedings of EMNLP , year=

  53. [57]

    Learned in Translation: Contextualized Word Vectors

    McCann, Bryan and Bradbury, James and Xiong, Caiming and Socher, Richard. Learned in Translation: Contextualized Word Vectors. Advances in Neural Information Processing Systems 30

  54. [58]

    Are Sixteen Heads Really Better than One?

    Michel, Paul and Levy, Omer and Neubig, Graham. Are Sixteen Heads Really Better than One?. arXiv:1905.10650

  55. [65]

    IEEE transactions on visualization and computer graphics , volume=

    Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks , author=. IEEE transactions on visualization and computer graphics , volume=. 2017 , publisher=

  56. [66]

    FlauBERT: Unsupervised Language Model Pre-training for French , booktitle =

    Le, Hang and Vial, Lo\". FlauBERT: Unsupervised Language Model Pre-training for French , booktitle =. 2020 , address =

  57. [67]

    Flair: An easy-to-use framework for state-of-the-art nlp

    Akbik, Alan and Bergmann, Tanja and Blythe, Duncan and Rasul, Kashif and Schweter, Stefan and Vollgraf, Roland. Flair: An easy-to-use framework for state-of-the-art nlp. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

  58. [68]

    The Stanford CoreNLP natural language processing toolkit

    Manning, Christopher D and Surdeanu, Mihai and Bauer, John and Finkel, Jenny Rose and Bethard, Steven and McClosky, David. The Stanford CoreNLP natural language processing toolkit. Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations

  59. [70]

    TensorFlow

    TensorFlow Hub. TensorFlow

  60. [72]

    Summary of the models --- transformers 3.0.0 documentation

  61. [76]

    SciBERT : A Pretrained Language Model for Scientific Text

    Beltagy, Iz and Lo, Kyle and Cohan, Arman. SciBERT : A Pretrained Language Model for Scientific Text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJCNLP )

  62. [84]

    spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing

    Honnibal, Matthew and Montani, Ines. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear

  63. [86]

    preprint arXiv:1701.02810 , year=

    Klein, Guillaume and Kim, Yoon and Deng, Yuntian and Senellart, Jean and Rush, Alexander M. OpenNMT : Open-Source Toolkit for Neural Machine Translation. arXiv:1701.02810

  64. [88]

    James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and Skye Wanderman-Milne , title =

  65. [89]

    13th \ USENIX \ Symposium on Operating Systems Design and Implementation ( \ OSDI \ 18) , pages=

    \ TVM \ : An automated end-to-end optimizing compiler for deep learning , author=. 13th \ USENIX \ Symposium on Operating Systems Design and Implementation ( \ OSDI \ 18) , pages=

  66. [90]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, Nils and Gurevych, Iryna. Sentence-BERT : Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084

  67. [92]

    Language models are unsupervised multitask learners

    Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya. Language models are unsupervised multitask learners. OpenAI Blog

  68. [94]

    Attention is All you Need

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Ukasz and Polosukhin, Illia. Attention is All you Need. Advances in Neural Information Processing Systems 30

  69. [95]

    Microsoft's Conference Management Toolkit

  70. [96]

    Change.org

    Sign the Petition. Change.org

  71. [97]

    Learning to love virtual conferences in the coronavirus era

    Woolston, Chris. Learning to love virtual conferences in the coronavirus era. Nature

  72. [98]

    Latent Alignment and Variational Attention

    Deng, Yuntian and Kim, Yoon and Chiu, Justin and Guo, Demi and Rush, Alexander M. Latent Alignment and Variational Attention. arXiv:1807.03756

  73. [99]

    Structured Neural Topic Models for Reviews

    Esmaeili, Babak and Huang, Hongyi and Wallace, Byron C and van de Meent, Jan-Willem. Structured Neural Topic Models for Reviews. arXiv:1812.05035

  74. [100]

    An estimate of an upper bound for the entropy of English

    Brown, Peter F and Pietra, Vincent J Della and Mercer, Robert L and Pietra, Stephen A Della and Lai, Jennifer C. An estimate of an upper bound for the entropy of English. Comput. Linguist

  75. [101]

    Prediction and Entropy of Printed English

    Shannon, C E. Prediction and Entropy of Printed English. Bell System Technical Journal

  76. [102]

    Dyna: A declarative language for implementing dynamic programs

    Eisner, Jason and Goldlust, Eric and Smith, Noah A. Dyna: A declarative language for implementing dynamic programs. Proceedings of the ACL 2004 on Interactive poster and demonstration sessions

  77. [103]

    Quick Training of Probabilistic Neural Nets by Importance Sampling

    Bengio, Yoshua and Sen \'e cal, Jean-S \'e bastien and Others. Quick Training of Probabilistic Neural Nets by Importance Sampling. AISTATS

  78. [104]

    Bowman, Luke Vilnis, Oriol Vinyals, Andrew M

    Bowman, Samuel R and Vilnis, Luke and Vinyals, Oriol and Dai, Andrew M and Jozefowicz, Rafal and Bengio, Samy. Generating Sentences from a Continuous Space. arXiv:1511.06349

  79. [105]

    Categorical Reparameterization with Gumbel-Softmax

    Jang, Eric and Gu, Shixiang and Poole, Ben. Categorical Reparameterization with Gumbel-Softmax. arXiv:1611.01144

  80. [106]

    Differentiable Perturb-and-Parse : Semi-Supervised Parsing with a Structured Variational Autoencoder

    Corro, Caio and Titov, Ivan. Differentiable Perturb-and-Parse : Semi-Supervised Parsing with a Structured Variational Autoencoder. arXiv:1807.09875

Showing first 80 references.