arxiv: 1907.11692 · v1 · submitted 2019-07-26 · 💻 cs.CL

Recognition: 1 theorem link

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Danqi Chen, Jingfei Du, Luke Zettlemoyer, Mandar Joshi, Mike Lewis, Myle Ott, Naman Goyal, Omer Levy, Veselin Stoyanov, Yinhan Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 01:10 UTC · model claude-opus-4-7

classification 💻 cs.CL

keywords masked language modelingBERTpretrainingGLUE benchmarkSQuADRACEbyte-level BPEdynamic masking

0 comments

The pith

A careful retraining of BERT — longer, on more data, with dynamic masking and no next-sentence loss — matches or beats every model published after it on GLUE, SQuAD, and RACE.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is a replication and tuning study of BERT pretraining. Its central move is to hold the architecture and the masked-language-modeling objective fixed and vary only the things that are usually treated as background: how long to train, on how much text, with what batch size, what masking schedule, what tokenizer, and whether to keep the next-sentence-prediction auxiliary loss. Under that controlled sweep, a retrained BERT — called RoBERTa — matches or surpasses every model published after BERT on GLUE, SQuAD, and RACE, including ones that introduced new pretraining objectives. The authors take this as evidence that the original BERT was significantly undertrained, that next-sentence prediction is unnecessary, and that masked language modeling remains competitive with permutation, span, and autoregressive alternatives once the training budget is matched. A sympathetic reader should care because the result reframes a year of reported "objective" gains as substantially gains in compute and data.

Core claim

The paper argues that BERT, as originally released, was undertrained, and that a careful replication holding architecture and objective fixed — while training longer, on roughly ten times more text, with larger batches, dynamic masking, no next-sentence-prediction loss, and a byte-level BPE vocabulary — matches or surpasses every post-BERT model published up to that point on GLUE, SQuAD, and RACE. The implication the authors press is that gains attributed to newer pretraining objectives or architectures may instead be explained by training budget and data scale.

What carries the argument

A controlled ablation over BERT's training recipe rather than its architecture: (1) dynamic masking instead of a fixed precomputed mask, (2) packing full sentences across document boundaries and dropping the next-sentence-prediction auxiliary loss, (3) batch sizes of 8K sequences with retuned learning rate and Adam β₂=0.98, (4) a 50K byte-level BPE vocabulary with no language-specific preprocessing, and (5) scaling pretraining data to 160GB (BookCorpus+Wikipedia plus CC-News, OpenWebText, and Stories) and pretraining for up to 500K steps. The architecture and the masked-language-modeling objective are held fixed at BERT_LARGE.

If this is right

<parameter name="0">Reported gains from newer pretraining objectives over BERT should be re-examined against compute-matched baselines
since training budget alone closes most of the gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

<parameter name="0">If most apparent progress over BERT is explained by training budget
then benchmark leaderboards in this period are partly tracking compute spend rather than modeling ideas — a methodological caution that extends well beyond NLP.

Load-bearing premise

That fixing the architecture and objective while changing data, steps, batch size, and tokenizer constitutes a fair attribution of credit — the comparison with competing methods does not retune those methods under matched compute, so the claim that masked language modeling is "competitive" with newer objectives rests on the assumption that the competitors would not pull ahead again under the same scaling treatment.

What would settle it

Retrain a competing model (e.g. XLNet or a permutation/span-based variant) under matched data (160GB), matched batch size (8K), and matched step count (500K) using the same byte-level BPE and dynamic masking, and compare GLUE/SQuAD/RACE numbers head-to-head. If the competitor still beats RoBERTa by a clear margin under matched compute, the claim that masked language modeling is competitive with the alternatives fails.

read the original abstract

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid, reproducible replication study showing BERT was undertrained; the cross-method "MLM ≈ permutation LM" claim is weaker than the abstract suggests but the practical contribution is real.

read the letter

This is the RoBERTa paper. You probably already know the bottom line: take BERT, train it longer on more data with bigger batches, drop NSP, use dynamic masking, switch to byte-level BPE, and you match or beat XLNet, SpanBERT, MT-DNN on GLUE/SQuAD/RACE. No architectural change. Code and checkpoints released.

What's good. The ablations are clean and reported as medians over five seeds, which was not standard practice at the time. The NSP-vs-no-NSP comparison (Table 2) is careful — they distinguish SEGMENT-PAIR, SENTENCE-PAIR, FULL-SENTENCES, DOC-SENTENCES, which lets them isolate "remove NSP loss" from "change input format," and the conclusion that earlier NSP-removal results were confounded by also shortening inputs is a genuinely useful piece of methodological hygiene. The batch-size and masking studies are small but informative. As an empirical artifact RoBERTa became a standard backbone, and deservedly.

Where I'd push back. The headline framing — "MLM is competitive with permutation LM" — is not supported at matched compute. Look at Table 4: at BOOKS+WIKI with comparable steps, XLNet_LARGE actually edges RoBERTa on SQuAD 2.0 (87.8 vs 87.3) and is ahead on MNLI in their own numbers. RoBERTa only clearly passes XLNet after ingesting an extra ~145GB of text and running 5× more optimizer steps. Footnote 2 concedes other methods would likely also benefit from more tuning, which is honest but unresolved. The stress-test note has this right.

A smaller point the stress test also flags: the 50K byte-level BPE adds ~20M parameters to the embedding table, and Section 4.4 waves at this with "early experiments revealed only slight differences" but ships no table. So a sliver of the BERT_LARGE → RoBERTa gap is parameter count, not training discipline. Minor, but the paper should have isolated it.

Neither concern overturns the practical claim that BERT was undertrained, which is well-supported. They do narrow the scientific claim.

Recommendation. Useful paper, well executed within its scope, honestly flagged caveats. The methodological message — control compute before crediting objectives — is worth taking seriously and citing. Worth referees' time; worth a reading group slot for the ablation methodology alone.

Referee Report

4 major / 6 minor

Summary. The paper presents a replication and ablation study of BERT pretraining. The authors reimplement BERT in fairseq, sweep four design axes — dynamic vs. static masking (§4.1), input format and the NSP loss (§4.2), batch size (§4.3), and byte-level BPE (§4.4) — and combine the favorable settings with substantially more data (160GB across BOOKS+WIKI, CC-NEWS, OPENWEBTEXT, STORIES) and more optimizer steps (up to 500K at 8K batch). The resulting model, RoBERTa, is reported to match or exceed all post-BERT published systems on GLUE (Table 5), SQuAD v1.1/v2.0 (Table 6), and RACE (Table 7), without multi-task finetuning on GLUE or external QA data on SQuAD. The central scientific claim is that BERT was significantly undertrained and that, with the right training recipe, the MLM objective is competitive with subsequently proposed alternatives such as permutation LM (XLNet).

Significance. If the result holds, the paper materially reshapes how the community attributes credit for the gains reported in 2018–2019: a sizable fraction of post-BERT improvement is attributable to data scale, batch size, and training length rather than to new objectives or architectures. This is a useful corrective and a high-value contribution to a literature where ablations against private data and undisclosed compute budgets have made comparisons unreliable. Concrete strengths are: (i) the ablations in Tables 1–3 are clean and use medians over five seeds; (ii) the introduction of CC-NEWS partially closes the public-data gap with concurrent work; (iii) models, code, and a documented hyperparameter recipe (Tables 9–10) are released, enabling third-party replication. The released artifact has in fact become a widely used baseline, which is itself evidence of the practical claim. The paper is appropriately modest in footnote 2 about the limits of its comparisons.

major comments (4)

[§5, Table 4] The headline claim that MLM is 'competitive with' permutation LM is not cleanly supported by the most controlled row of Table 4. At matched BOOKS+WIKI data, XLNet_LARGE reports 94.0/87.8 on SQuAD 1.1/2.0 and 88.4 on MNLI-m, while RoBERTa-BOOKS+WIKI reports 93.6/87.3 and 89.0 — RoBERTa loses on SQuAD 2.0 and is within noise on MNLI. RoBERTa only clearly surpasses XLNet after adding ~10× more text and 5× more updates (500K), at which point XLNet itself is also no longer at its matched-data setting. Please either (a) restate the conclusion as 'MLM is competitive once given comparable or larger training budget,' or (b) report a compute- and data-matched comparison (same corpus, same token count seen, same batch and step budget). The current phrasing in §1 and §7 overstates what Table 4 shows.
[§4.4 / Table 4] The switch to a 50K byte-level BPE adds approximately 20M parameters to BERT_LARGE (the paper's own estimate in §4.4). This confounds the BERT_LARGE → RoBERTa-BOOKS+WIKI comparison in Table 4 (90.9/81.8 → 93.6/87.3 on SQuAD), since part of the gap may reflect added embedding capacity rather than 'BERT was undertrained.' §4.4 states 'early experiments revealed only slight differences' but provides no table. A small ablation isolating 30K char-BPE vs 50K byte-BPE at otherwise matched settings (one row would suffice) would close this gap and is important because the BPE choice is one of the four pillars of the recipe.
[§4.3, Table 3] The large-batch comparison varies batch size, step count, and learning rate jointly while reporting only perplexity and two GLUE dev metrics. The 2K-batch/125K-step setting outperforms both 256/1M and 8K/31K on perplexity (3.68 vs 3.99 vs 3.77), yet the paper adopts 8K for downstream experiments citing parallelization. Please clarify why 2K is not the preferred choice on the evidence presented, or report SQuAD/RACE numbers for the three settings so the choice is grounded in end-task performance rather than engineering convenience.
[§4.2, Table 2] The conclusion that removing NSP 'matches or slightly improves' downstream task performance is drawn from differences that are often within plausible seed variance (e.g., FULL-SENTENCES 84.7 vs SEGMENT-PAIR+NSP 84.0 on MNLI-m; 92.5 vs 92.9 on SST-2). Reported numbers are medians over five seeds but no spread is given. Please report standard deviations or min/max across seeds for Table 2 so the reader can judge whether the NSP-removal effect exceeds noise; this matters because removing NSP is one of the four headline modifications.

minor comments (6)

[§3.2] CC-NEWS filtering is described in one sentence ('76GB after filtering'). A short description of the filter (language ID, dedup, boilerplate removal) would help replication, especially since the dataset is presented as a contribution.
[Table 4] The 'data' column lists 13GB for XLNet and 16GB for RoBERTa under 'BOOKS+WIKI'; footnote 3 attributes this to Wikipedia cleaning differences. Worth restating in the Table 4 caption so a casual reader does not mistake this for a data-budget mismatch in RoBERTa's favor at the matched row.
[§5.1, WNLI] The WNLI procedure (margin ranking with spaCy-extracted candidates, SuperGLUE reformatting) is non-standard and excludes negative training examples. Given the 91.3 dev / 89.0 test number contributes to the average, a sentence explicitly flagging that this score is not directly comparable to other systems' WNLI numbers would be appropriate.
[§4.1, Table 1] The dynamic-vs-static gap is small (e.g., 78.7 vs 78.3 SQuAD 2.0; 84.0 vs 84.3 MNLI). Calling dynamic masking 'comparable or slightly better' is fair, but the abstract and §1 list dynamic masking as one of four key improvements; consider softening the framing to match Table 1.
[Typography] Several places contain OCR-like artifacts in the submitted PDF ('Y ang', 'Y ou', 'V aswani', 'B OOK CORPUS'); please verify font/encoding in the camera-ready.
[§5] The Appendix hyperparameters (Tables 9–10) are useful; consider also reporting the total wall-clock and GPU-hours per pretraining run so future replications can budget appropriately. The text mentions '1024 V100 GPUs for approximately one day' but only for one configuration.

Simulated Author's Rebuttal

4 responses · 2 unresolved

We thank the referee for the careful and substantive report, and in particular for distinguishing the empirical contribution (the recipe and the released artifact) from the rhetorical framing of the central claim. We accept the four major points essentially as stated. The referee is correct that (i) our 'competitive with permutation LM' claim is properly conditioned on training budget rather than asserted at matched data, (ii) the 50K byte-level BPE adds parameters and partially confounds the BERT_LARGE -> RoBERTa comparison in Table 4, (iii) the choice of 8K batch over 2K in Table 3 is motivated by parallelization rather than by a clean end-task win, and (iv) several of the NSP-removal contrasts in Table 2 are within seed noise and the prose should reflect this. We will revise §1, §4.2, §4.3, §4.4, §5 and §7 accordingly, add per-seed spread to Table 2, add the BASE-scale BPE comparison and SQuAD numbers for the Table 3 settings to the appendix, and explicitly bound the headline claim. Two items — a strictly token-matched re-run against XLNet, and a LARGE-scale char-BPE vs byte-BPE ablation — we cannot produce within the revision window; we list these as standing objections and will disclose them rather than overclaim.

read point-by-point responses

Referee: Major #1 [§5, Table 4]: Headline claim that MLM is 'competitive with' permutation LM is not cleanly supported at matched BOOKS+WIKI. RoBERTa loses on SQuAD 2.0 (87.3 vs 87.8) and is within noise on MNLI; the win only emerges after ~10x data and 5x steps, by which point XLNet is also off its matched-data setting. Restate to 'competitive once given comparable or larger budget,' or run a strictly compute- and data-matched comparison.

Authors: The referee is right that our matched-data row is the appropriate basis for the strongest version of the claim, and that on that row RoBERTa is essentially tied with (and slightly behind on SQuAD 2.0) XLNet_LARGE rather than dominating it. Our intended claim was the weaker one the referee articulates: that MLM remains competitive with permutation LM when given a comparable or larger training budget, and that a substantial portion of the post-BERT gains attributed to new objectives can be recovered by training scale alone. We will revise §1 and §7 to state this more precisely, replacing 'match or exceed every model published after it' in unqualified form with language that explicitly conditions on training budget. We will also add a sentence to §5 noting that at matched BOOKS+WIKI / 1M-equivalent budget, RoBERTa and XLNet_LARGE are within ~0.5 points on SQuAD/MNLI, and that the larger-budget rows of Table 4 are not budget-matched against XLNet's own larger-budget row (94.5/88.8, 89.8 with 126GB / 500K / batch 2K). A strictly token-matched re-run against XLNet is unfortunately outside what we can produce within the revision window — XLNet's permutation training has a different effective tokens-per-step accounting, and we do not have access to their exact data composition — but we will state this limitation explicitly rather than paper over it. revision: yes
Referee: Major #2 [§4.4 / Table 4]: The 50K byte-level BPE adds ~20M parameters to BERT_LARGE, confounding the BERT_LARGE -> RoBERTa-BOOKS+WIKI comparison. §4.4 asserts 'only slight differences' but shows no table. Provide a one-row ablation isolating 30K char-BPE vs 50K byte-BPE at matched settings.

Authors: We agree this is a real confound and that §4.4's qualitative remark is not a substitute for a number. Our internal early experiments compared the two encodings at BERT_BASE scale with otherwise matched settings and did not show systematic gains for byte-level BPE (in fact slightly worse on some tasks, as noted), which is why we framed the choice as motivated by universality rather than accuracy. We will add a row to the appendix giving the head-to-head dev numbers we have at BASE scale, and we will explicitly flag in §4.4 and in the discussion of Table 4 that the ~20M-parameter increase at LARGE is a confound for the BERT_LARGE -> RoBERTa-BOOKS+WIKI delta, so that readers do not attribute the full 90.9 -> 93.6 SQuAD 1.1 gap to 'undertraining.' We do not have a fully matched 30K-char vs 50K-byte run at LARGE scale, and we will say so rather than overclaim. revision: partial
Referee: Major #3 [§4.3, Table 3]: Batch size, steps, and learning rate vary jointly; only ppl + two GLUE metrics are reported. 2K/125K beats 8K/31K on ppl but 8K is adopted citing parallelization. Justify on end-task performance or report SQuAD/RACE for the three settings.

Authors: The referee has correctly identified that Table 3 does not on its face justify 8K over 2K on accuracy grounds. The honest statement of our reasoning is engineering: at the scale of the §5 experiments (1024 V100s, 500K steps, 160GB), 8K batches were materially easier to keep utilization high under distributed data-parallel training, and the dev-set differences we observed between 2K and 8K at this controlled BASE-scale setup were small and did not consistently favor 2K on downstream tasks beyond what Table 3 shows. We will (i) explicitly state in §4.3 that the choice of 8K over 2K is driven by parallelization rather than by an accuracy advantage on Table 3, (ii) add SQuAD numbers for the three Table 3 settings to the appendix where we have them, and (iii) soften the implication that 8K is optimal on the evidence presented. We agree this is a fair correction. revision: yes
Referee: Major #4 [§4.2, Table 2]: NSP-removal effects are within plausible seed variance (e.g., FULL-SENTENCES 84.7 vs SEGMENT-PAIR+NSP 84.0 on MNLI-m; 92.5 vs 92.9 on SST-2). Report std / min-max across the five seeds so readers can judge whether the effect exceeds noise.

Authors: This is well taken. Our claim in §4.2 is deliberately phrased as 'matches or slightly improves' rather than 'improves,' precisely because for several of the cells the gap is within what we observe across seeds, and the stronger statement we make is the negative one — that retaining NSP does not help and that SENTENCE-PAIR (which forces short inputs) clearly hurts. We will add per-cell spread (std and min/max over the five seeds) to Table 2 in the revision, for both the NSP and the input-format rows, so the reader can see directly which contrasts are above seed noise (SEGMENT-PAIR vs SENTENCE-PAIR; SEGMENT-PAIR vs DOC-SENTENCES on SQuAD/RACE) and which are not (FULL-SENTENCES vs SEGMENT-PAIR+NSP on MNLI/SST-2). We will also adjust the prose in §4.2 and §7 so that the headline summary about NSP is 'removing NSP does not hurt, and removing it together with the SENTENCE-PAIR format helps,' rather than implying a uniform improvement. revision: yes

standing simulated objections not resolved

A strictly token-, batch-, and step-matched head-to-head against XLNet (Major #1) is not feasible within the revision window: we lack access to XLNet's exact data composition and the permutation objective's per-step token accounting differs from MLM's. We will instead bound the claim and disclose the limitation, rather than produce a comparison we cannot run cleanly.
We do not have a fully matched 30K char-BPE vs 50K byte-BPE ablation at LARGE scale (Major #2). We can add the BASE-scale comparison we did run, and we will flag the parameter-count confound at LARGE explicitly, but a LARGE-scale matched ablation is beyond the compute we can commit to this revision.

Circularity Check

0 steps flagged

No meaningful circularity: RoBERTa's claims are evaluated on external held-out benchmarks (GLUE leaderboard, SQuAD, RACE), not on quantities fitted by the authors.

full rationale

This is an empirical replication/ablation study of BERT pretraining. The central claims — that dynamic masking, removing NSP, larger batches, byte-level BPE, more data, and longer training each improve downstream performance, and that the resulting model matches/exceeds XLNet — are evaluated against external benchmarks (GLUE test via a third-party leaderboard, SQuAD 1.1/2.0, RACE) using metrics (F1, EM, accuracy) defined outside the paper. Hyperparameters are tuned on dev sets and reported on test sets; there is no instance of a fitted parameter being renamed as a prediction, no self-definitional loop, and no load-bearing self-citation that would constitute circularity in the technical sense. Comparisons to BERT and XLNet quote numbers from those papers (Devlin et al. 2019; Yang et al. 2019), which are independent prior work, not self-citations of the present authors used to forbid alternatives. The reader's and skeptic's critiques are real concerns but are about *fairness of attribution under non-matched compute/data budgets*, not about circularity. Specifically: (i) at matched BOOKS+WIKI, XLNet edges RoBERTa on SQuAD 2.0 (87.8 vs 87.3), so the "MLM ≈ permutation LM" claim leans on the extra-data/extra-steps rows; (ii) the byte-level BPE adds ~20M params with no isolated ablation ("early experiments revealed only slight differences" — Section 4.4 — but no table). These are confounds in causal attribution and belong under correctness/scope risk, not circularity. The derivation chain itself does not collapse to its inputs by construction. A score of 1 reflects only routine self-citation (Ott et al. 2018, 2019 for fairseq and large-batch NMT) which is methodological tooling, not load-bearing for the empirical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

RoBERTa introduces essentially no new theoretical machinery. It assumes the standard transformer architecture, the masked-LM objective, and Adam optimization, all from prior literature. Its "free parameters" are training hyperparameters (learning rate, batch size, step count, warmup ratio, vocabulary size) tuned empirically on dev data — this is standard ML practice rather than ad-hoc parameter inflation, but it is honest to list them. No new physical or mathematical entities are postulated.

pith-pipeline@v0.9.0 · 9496 in / 5623 out tokens · 86650 ms · 2026-05-09T01:10:08.388354+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
Measuring Massive Multitask Language Understanding
cs.CY 2020-09 accept novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
cs.CL 2019-08 unverdicted novelty 8.0

Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matchin...
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
cs.CL 2026-05 unverdicted novelty 7.0

LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
Is She Even Relevant? When BERT Ignores Explicit Gender Cues
cs.CL 2026-05 conditional novelty 7.0

A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA
cs.LG 2026-05 unverdicted novelty 7.0

GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
cs.SE 2026-05 unverdicted novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning
cs.LG 2026-05 unverdicted novelty 7.0

AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.
Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LLMs outperform single human raters at spotting relative weaknesses in L2 writing profiles on the ICNALE GRA dataset while humans are better at spotting strengths, using a self-referential intra-learner evaluation method.
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
Deep Graph-Language Fusion for Structure-Aware Code Generation
cs.SE 2026-05 unverdicted novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports
cs.CL 2026-05 unverdicted novelty 7.0

MedStruct-S benchmark shows encoder-only models outperform larger decoder-only ones on key-conditioned QA from noisy OCR clinical reports, with fine-tuned large models winning only when scale is ignored.
How Language Models Process Negation
cs.CL 2026-05 unverdicted novelty 7.0

LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
cs.CL 2026-05 unverdicted novelty 7.0

EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
cs.CL 2026-05 unverdicted novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
cs.LG 2026-05 unverdicted novelty 7.0

InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
cs.SE 2026-04 unverdicted novelty 7.0

DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
cs.LG 2026-04 unverdicted novelty 7.0

Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
cs.AI 2026-04 unverdicted novelty 7.0

A two-agent adversarial rewriting framework achieves 20-40% evasion rates against LLM-based misinformation detectors under strict black-box constraints with binary feedback only, far outperforming prior methods and li...
Not all ANIMALs are equal: metaphorical framing through source domains and semantic frames
cs.CL 2026-04 unverdicted novelty 7.0

An NLP framework shows that liberals and conservatives use different semantic frames within the same metaphorical source domains when discussing immigration, while also uncovering nuanced frames in climate change coverage.
A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition
cs.CL 2026-04 unverdicted novelty 7.0

Low information density is identified as the root cause of NER failures on user-generated content, with the Window-Aware Optimization Module delivering up to 4.5% F1 gains and new SOTA on WNUT2017.
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
cs.CR 2026-04 unverdicted novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
SecureRouter: Encrypted Routing for Efficient Secure Inference
cs.CR 2026-04 unverdicted novelty 7.0

SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
Psychological Steering of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport
cs.AI 2026-04 unverdicted novelty 7.0

GCTM-OT extracts goal candidates with an LLM, then uses goal-prompted contrastive learning and optimal transport to discover topics that are more coherent, diverse, and aligned with human intent than prior methods on ...
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
cs.CL 2026-04 unverdicted novelty 7.0

METRO induces both short-term actions and long-term planning from expert transcripts into a Strategy Forest, outperforming prior methods by 9-10% on two non-collaborative dialogue benchmarks.
SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates
cs.LG 2026-04 unverdicted novelty 7.0

LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.
BMdataset: A Musicologically Curated LilyPond Dataset
cs.SD 2026-04 unverdicted novelty 7.0

A musicologically curated LilyPond dataset of 393 Baroque scores enables LilyBERT to outperform large-scale pre-training on composer and style classification when used alone for fine-tuning.
LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset
cs.CL 2026-04 unverdicted novelty 7.0

LASQ is a new quadruple extraction dataset for Uzbek and Uyghur that includes a syntax-aware model showing gains over baselines on the task.
Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities
cs.AI 2026-04 unverdicted novelty 7.0

TransFIR enables reasoning on temporal knowledge graphs for emerging entities by clustering them into semantic groups and borrowing interaction histories from similar known entities, yielding 28.6% average MRR gains.
Mask-Free Privacy Extraction and Rewriting: A Domain-Aware Approach via Prototype Learning
cs.CR 2026-04 unverdicted novelty 7.0

DAMPER learns domain privacy prototypes via contrastive learning and uses them to guide mask-free privacy extraction, preference-aligned rewriting, and differential privacy sampling for LLMs.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Follow My Eyes: Backdoor Attacks on VLM-based Scanpath Prediction
cs.CR 2026-04 conditional novelty 7.0

Backdoor attacks on VLM-based scanpath predictors can redirect fixations toward chosen objects or inflate durations using input-conditioned triggers that evade cluster detection, and no tested defense blocks them with...
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
cs.CL 2026-04 conditional novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations
cs.CL 2026-04 unverdicted novelty 7.0

iTAG generates natural text paired with accurate causal graph annotations by framing concept assignment as an inverse problem and refining selections via chain-of-thought reasoning until the text's relations align wit...
Graph Topology Information Enhanced Heterogeneous Graph Representation Learning
cs.LG 2026-04 unverdicted novelty 7.0

ToGRL learns high-quality graph structures from raw heterogeneous graphs via a two-stage topology extraction process and prompt tuning, outperforming prior methods on five datasets.
The Indra Representation Hypothesis for Multimodal Alignment
cs.CV 2026-04 unverdicted novelty 7.0

Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal...
SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models
cs.CL 2026-03 unverdicted novelty 7.0

SeaAlert generates synthetic noisy maritime distress transcripts via LLM and ASR simulation to train robust extraction of critical information from real VHF communications.
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
cs.CL 2024-11 accept novelty 7.0

Tulu 3 provides open SOTA post-trained LLMs with a novel RLVR algorithm and complete reproducibility artifacts that surpass Llama 3.1 instruct, Qwen 2.5, Mistral, GPT-4o-mini, and Claude 3.5-Haiku on benchmarks.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
cs.CL 2023-01 unverdicted novelty 7.0

VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
cs.LG 2022-05 accept novelty 7.0

FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Prefix-Tuning: Optimizing Continuous Prompts for Generation
cs.CL 2021-01 conditional novelty 7.0

Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
cs.CL 2020-06 unverdicted novelty 7.0

DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...
Longformer: The Long-Document Transformer
cs.CL 2020-04 accept novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
cs.CL 2019-10 accept novelty 7.0

BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing
cs.CL 2026-05 unverdicted novelty 6.0

Pre-trained encoder-decoder transformers fine-tuned for sequence-to-sequence constituent parsing outperform prior seq2seq models and compete with specialized parsers on continuous treebanks.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 207 Pith papers

[1]

Eneko Agirre, Llu' i s M`arquez, and Richard Wicentowski, editors. 2007. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

work page 2007
[2]

Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. 2019. Cloze-driven pretraining of self-attention networks. arXiv preprint arXiv:1903.07785

work page Pith review arXiv 2019
[3]

Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment

work page 2006
[4]

Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge

work page 2009
[5]

Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2015
[6]

William Chan, Nikita Kitaev, Kelvin Guu, Mitchell Stern, and Jakob Uszkoreit. 2019. KERMIT : Generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604

work page arXiv 2019
[7]

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment

work page 2006
[8]

Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems (NIPS)

work page 2015
[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL)

work page 2019
[10]

William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing

work page 2005
[11]

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197

work page arXiv 2019
[12]

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing

work page 2007
[13]

Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://web.archive.org/save/http://Skylion007.github.io/OpenWebTextCorpus

work page 2019
[14]

Felix Hamborg, Norman Meuschke, Corinna Breitinger, and Bela Gipp. 2017. news-please: A generic news crawler and extractor. In Proceedings of the 15th International Symposium of Information Science

work page 2017
[15]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

work page Pith review arXiv 2016
[16]

Matthew Honnibal and Ines Montani. 2017. spaCy 2 : Natural language understanding with B loom embeddings, convolutional neural networks and incremental parsing. To appear

work page 2017
[17]

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146

work page arXiv 2018
[18]

Shankar Iyer, Nikhil Dandekar, and Kornél Csernai. 2016. First quora dataset release: Question pairs. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

work page 2016
[19]

SpanBERT: Improving pre-training by representing and predicting spans

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT : Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529

work page arXiv 2019
[20]

Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR)

work page 2015
[21]

Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. 2019. A surprisingly robust trick for winograd schema challenge. arXiv preprint arXiv:1905.06290

work page arXiv 2019
[22]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683

work page Pith review arXiv 2017
[23]

Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291

work page arXiv 2019
[24]

Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The W inograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning

work page 2011
[25]

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019 a . Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482

work page arXiv 2019
[26]

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019 b . Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504

work page arXiv 2019
[27]

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems (NIPS), pages 6297--6308

work page 2017
[28]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed precision training. In International Conference on Learning Representations

work page 2018
[29]

Sebastian Nagel. 2016. Cc-news. http://web.archive.org/save/http://commoncrawl.org/2016/10/news-dataset-available

work page 2016
[30]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq : A fast, extensible toolkit for sequence modeling. In North American Association for Computational Linguistics (NAACL): System Demonstrations

work page 2019
[31]

Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT)

work page 2018
[32]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch . In NIPS Autodiff Workshop

work page 2017
[33]

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In North American Association for Computational Linguistics (NAACL)

work page 2018
[34]

Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI

work page 2018
[35]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Technical report, OpenAI

work page 2019
[36]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don't know: Unanswerable questions for squad. In Association for Computational Linguistics (ACL)

work page 2018
[37]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD : 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2016
[38]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Association for Computational Linguistics (ACL), pages 1715--1725

work page 2016
[39]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2013
[40]

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS : Masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning (ICML)

work page 2019
[41]

Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE : Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223

work page arXiv 2019
[42]

Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847

work page arXiv 2018
[43]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems

work page 2017
[44]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019 a . Super GLUE : A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537

work page arXiv 2019
[45]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019 b . GLUE : A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR)

work page 2019
[46]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2018. Neural network acceptability judgments. arXiv preprint 1805.12471

work page arXiv 2018
[47]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In North American Association for Computational Linguistics (NAACL)

work page 2018
[48]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237

work page arXiv 2019
[49]

Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. 2019. Reducing bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962

work page arXiv 2019
[50]

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. arXiv preprint arXiv:1905.12616

work page arXiv 2019
[51]

Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724

work page arXiv 2015