pith. machine review for the scientific record. sign in

arxiv: 1906.08237 · v2 · pith:L7AEHQT4new · submitted 2019-06-19 · 💻 cs.CL · cs.LG

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Pith reviewed 2026-05-18 01:24 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords XLNetautoregressive pretrainingpermutation language modelingbidirectional contextTransformer-XLlanguage understandingBERT comparison
0
0 comments X

The pith

XLNet learns bidirectional context by maximizing expected likelihood over all permutations of factorization order.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes XLNet to fix two problems in existing pretraining methods. Autoregressive models like standard language modeling see only left-to-right context, while masked models like BERT break dependencies between masked tokens and create a mismatch when fine-tuning. XLNet keeps the autoregressive form but trains by averaging the likelihood of every possible ordering of the tokens in a sentence. This produces bidirectional context without masks. The resulting models beat BERT on twenty downstream tasks under matched conditions.

Core claim

XLNet is a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

What carries the argument

Permutation language modeling objective that maximizes the expected log-likelihood of a sequence over all possible permutations of its factorization order.

Load-bearing premise

Averaging the likelihood over all permutations of the factorization order teaches effective bidirectional context without new optimization problems or sampling biases.

What would settle it

Train an XLNet variant that uses only one fixed factorization order instead of averaging over permutations and measure whether its advantage over BERT disappears on the reported tasks.

read the original abstract

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes XLNet, a generalized autoregressive pretraining method that maximizes the expected likelihood over all permutations of the factorization order to enable bidirectional context modeling. It addresses BERT's mask dependency and pretrain-finetune discrepancy while integrating Transformer-XL's segment recurrence and relative positional encodings. Under comparable settings, XLNet is shown to outperform BERT on 20 downstream tasks spanning question answering, natural language inference, sentiment analysis, and document ranking.

Significance. If the central empirical claims hold after verification of controls, the work would be significant as a new pretraining paradigm that retains autoregressive tractability while achieving bidirectional context. The explicit permutation objective and its integration with established components like Transformer-XL provide a clear alternative to denoising autoencoders, with potential for broader adoption in language model pretraining.

major comments (2)
  1. [§3.2] §3.2 (Two-stream self-attention): The mechanism is presented as preventing position-content leakage, yet the paper provides no formal argument or targeted ablation demonstrating that the query stream fully isolates content from positional information across all sampled permutations. This is load-bearing for the claim that gains derive from the permutation objective rather than architectural side effects.
  2. [§4.1] §4.1 (Experimental setup): The number of permutations sampled per sequence during training is described only at a high level; without an ablation relating sample count to downstream performance, it remains unclear whether the Monte Carlo approximation is sufficient to realize the full expected-likelihood objective or whether reported gains partly reflect sampling bias or the two-stream/Transformer-XL additions.
minor comments (2)
  1. [Table 1] Table 1 and associated text: clarify whether all baselines use identical data splits and hyperparameter budgets; a single additional column reporting matched hyperparameter counts would strengthen comparability.
  2. [Figure 2] Figure 2: the legend for the two-stream attention diagram is underspecified; explicitly label the query and content streams in the caption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below with clarifications and indicate where we will revise the manuscript to incorporate additional explanations and analyses.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Two-stream self-attention): The mechanism is presented as preventing position-content leakage, yet the paper provides no formal argument or targeted ablation demonstrating that the query stream fully isolates content from positional information across all sampled permutations. This is load-bearing for the claim that gains derive from the permutation objective rather than architectural side effects.

    Authors: We appreciate the referee highlighting this aspect of the two-stream self-attention. The query stream is constructed so that its representation at position z_t attends exclusively to the content representations of the preceding positions in the permutation (z_1 to z_{t-1}), using the attention mask defined in Equations (3)–(4) together with relative positional encodings. This ensures the query never accesses the content embedding of the token being predicted, thereby preventing leakage while still incorporating order information. Although the original manuscript did not supply a standalone formal isolation argument or a dedicated ablation, the separation follows directly from the stream-specific parameterization and masking. In the revision we will expand Section 3.2 with a concise derivation showing the conditional independence property and add a targeted ablation (in the appendix) that compares the full two-stream model against a content-leaking variant on a representative task. revision: yes

  2. Referee: [§4.1] §4.1 (Experimental setup): The number of permutations sampled per sequence during training is described only at a high level; without an ablation relating sample count to downstream performance, it remains unclear whether the Monte Carlo approximation is sufficient to realize the full expected-likelihood objective or whether reported gains partly reflect sampling bias or the two-stream/Transformer-XL additions.

    Authors: We agree that the description of the Monte Carlo approximation in §4.1 is high-level. In our implementation we draw a fixed number of permutations (K = 6) per sequence to approximate the expectation; this choice is stated in the experimental details but without sensitivity analysis. To demonstrate that the approximation is adequate and that gains are not artifacts of sampling bias or the auxiliary architectural components, we will add an ablation that varies K (1, 3, 6, 12) while holding the two-stream attention and Transformer-XL recurrence fixed, reporting downstream performance on a subset of tasks. The results will be included in the revised experimental section or appendix. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to Transformer-XL architecture; core permutation LM objective is independently defined and evaluated on external tasks

full rationale

The paper defines its central contribution as maximizing the expected log-likelihood over all permutations of the factorization order, an explicit new objective that is not constructed from or equivalent to any fitted parameter or prior result within the paper. Performance is measured against external benchmarks (20 tasks including QA and NLI) rather than reducing to internal fits. Integration of Transformer-XL ideas is cited for the autoregressive backbone and memory mechanism, but this is an architectural choice whose contribution is separable from the permutation objective and does not bear the load of the bidirectional-context claim. No self-definitional loop, fitted-input-as-prediction, or uniqueness theorem imported from the same authors appears in the derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard transformer attention mechanism and the assumption that uniform sampling over permutations yields an unbiased estimator of bidirectional dependencies; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • permutation sampling distribution
    The paper must choose how to sample or enumerate permutations; this choice is a modeling decision that affects training dynamics.
axioms (1)
  • domain assumption The attention mechanism in Transformer-XL can be applied to arbitrary factorization orders without architectural change.
    The integration of Transformer-XL is invoked to handle long-range dependencies under permutation orders.

pith-pipeline@v0.9.0 · 5701 in / 1219 out tokens · 31002 ms · 2026-05-18T01:24:35.604794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  2. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    cs.CL 2019-08 unverdicted novelty 8.0

    Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matchin...

  3. SecureRouter: Encrypted Routing for Efficient Secure Inference

    cs.CR 2026-04 unverdicted novelty 7.0

    SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.

  4. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  5. GraphCodeBERT: Pre-training Code Representations with Data Flow

    cs.SE 2020-09 accept novelty 7.0

    GraphCodeBERT uses data flow graphs in pre-training to capture semantic code structure and reaches state-of-the-art results on code search, clone detection, translation, and refinement.

  6. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    cs.CL 2019-10 accept novelty 7.0

    BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.

  7. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  8. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    cs.CL 2019-09 unverdicted novelty 7.0

    Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.

  9. ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    ORPHEAS, a Greek-English embedding model created with knowledge graph fine-tuning, outperforms state-of-the-art multilingual models on monolingual and cross-lingual retrieval benchmarks.

  10. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  11. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  12. Scaling Laws for Transfer

    cs.LG 2021-02 unverdicted novelty 6.0

    Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

  13. CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    cs.CL 2020-02 unverdicted novelty 6.0

    CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.

  14. How Much Knowledge Can You Pack Into the Parameters of a Language Model?

    cs.CL 2020-02 accept novelty 6.0

    Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.

  15. Compressive Transformers for Long-Range Sequence Modelling

    cs.LG 2019-11 unverdicted novelty 6.0

    Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.

  16. HuggingFace's Transformers: State-of-the-art Natural Language Processing

    cs.CL 2019-10 accept novelty 6.0

    Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.

  17. RoBERTa: A Robustly Optimized BERT Pretraining Approach

    cs.CL 2019-07 accept novelty 5.0

    With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 17 Pith papers · 16 internal anchors

  1. [1]

    Character-Level Language Modeling with Deeper Self-Attention

    Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. arXiv preprint arXiv:1808.04444, 2018

  2. [2]

    Bam! born-again multi-task networks for natural language understanding

    Anonymous. Bam! born-again multi-task networks for natural language understanding. anony- mous preprint under review, 2018

  3. [3]

    Adaptive Input Representations for Neural Language Modeling

    Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018. 9

  4. [4]

    Modeling high-dimensional discrete data with multi-layer neural networks

    Yoshua Bengio and Samy Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In Advances in Neural Information Processing Systems, pages 400–406, 2000

  5. [5]

    Clueweb09 data set, 2009

    Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. Clueweb09 data set, 2009

  6. [6]

    Common crawl

    Common Crawl. Common crawl. URl: http://http://commoncrawl. org, 2019

  7. [7]

    Semi-supervised sequence learning

    Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087, 2015

  8. [8]

    Convolutional neural networks for soft-matching n-grams in ad-hoc search

    Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the eleventh ACM international conference on web search and data mining, pages 126–134. ACM, 2018

  9. [9]

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019

  10. [10]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  11. [11]

    MaskGAN: Better Text Generation via Filling in the______

    William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: better text generation via filling in the_. arXiv preprint arXiv:1801.07736, 2018

  12. [12]

    Made: Masked autoencoder for distribution estimation

    Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015

  13. [13]

    A deep relevance matching model for ad-hoc retrieval

    Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 55–64. ACM, 2016

  14. [14]

    Universal Language Model Fine-tuning for Text Classification

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifica- tion. arXiv preprint arXiv:1801.06146, 2018

  15. [15]

    Deep pyramid convolutional neural networks for text catego- rization

    Rie Johnson and Tong Zhang. Deep pyramid convolutional neural networks for text catego- rization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 562–570, 2017

  16. [16]

    A surprisingly robust trick for winograd schema challenge

    Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for winograd schema challenge. arXiv preprint arXiv:1905.06290, 2019

  17. [17]

    SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

    Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018

  18. [18]

    RACE: Large-scale ReAding Comprehension Dataset From Examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017

  19. [19]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019

  20. [20]

    Multi-Task Deep Neural Networks for Natural Language Understanding

    Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019

  21. [21]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  22. [22]

    Learned in translation: Contextualized word vectors

    Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305, 2017

  23. [23]

    Adversarial training methods for semi- supervised text classification

    Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi- supervised text classification. arXiv preprint arXiv:1605.07725, 2016

  24. [24]

    Pixel Recurrent Neural Networks

    Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016

  25. [25]

    Improving question answering with external knowledge

    Xiaoman Pan, Kai Sun, Dian Yu, Heng Ji, and Dong Yu. Improving question answering with external knowledge. arXiv preprint arXiv:1902.00993, 2019. 10

  26. [26]

    English gigaword fifth edition, linguistic data consortium

    Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword fifth edition, linguistic data consortium. Technical report, Technical Report. Linguistic Data Consortium, Philadelphia, Tech. Rep., 2011

  27. [27]

    Deep contextualized word representations

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken- ton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018

  28. [28]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- assets/research-covers/languageunsupervised/language understanding paper. pdf, 2018

  29. [29]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018

  30. [30]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

  31. [31]

    Revisiting lstm networks for semi-supervised text classification via mixed objective function

    Devendra Singh Sachan, Manzil Zaheer, and Ruslan Salakhutdinov. Revisiting lstm networks for semi-supervised text classification via mixed objective function. 2018

  32. [32]

    Neural autoregressive distribution estimation

    Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184– 7220, 2016

  33. [33]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017

  34. [34]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR

  35. [35]

    Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V . Le. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019

  36. [36]

    End-to-end neural ad-hoc ranking with kernel pooling

    Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval, pages 55–64. ACM, 2017

  37. [37]

    Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

    Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017

  38. [38]

    Dual co- matching network for multi-choice reading comprehension

    Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. Dual co- matching network for multi-choice reading comprehension. arXiv preprint arXiv:1901.09381, 2019

  39. [39]

    Character-level convolutional networks for text classification

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015

  40. [40]

    Layer-wise decay

    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015. 11 A Target-Aware Representation via Two-Stream Self-...

  41. [41]

    Thom Yorke is the singer of Radiohead

    is only able to cover the dependency (x = York, U = {New}) but not (x = New, U = {York}). XLNet, on the other hand, is able to cover both in expectation over all factorization orders. Such a limitation of AR language modeling can be critical in real-world applications. For example, consider a span extraction question answering task with the context “Thom ...