Recognition: 1 theorem link
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Pith reviewed 2026-05-09 01:10 UTC · model claude-opus-4-7
The pith
A careful retraining of BERT — longer, on more data, with dynamic masking and no next-sentence loss — matches or beats every model published after it on GLUE, SQuAD, and RACE.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper argues that BERT, as originally released, was undertrained, and that a careful replication holding architecture and objective fixed — while training longer, on roughly ten times more text, with larger batches, dynamic masking, no next-sentence-prediction loss, and a byte-level BPE vocabulary — matches or surpasses every post-BERT model published up to that point on GLUE, SQuAD, and RACE. The implication the authors press is that gains attributed to newer pretraining objectives or architectures may instead be explained by training budget and data scale.
What carries the argument
A controlled ablation over BERT's training recipe rather than its architecture: (1) dynamic masking instead of a fixed precomputed mask, (2) packing full sentences across document boundaries and dropping the next-sentence-prediction auxiliary loss, (3) batch sizes of 8K sequences with retuned learning rate and Adam β₂=0.98, (4) a 50K byte-level BPE vocabulary with no language-specific preprocessing, and (5) scaling pretraining data to 160GB (BookCorpus+Wikipedia plus CC-News, OpenWebText, and Stories) and pretraining for up to 500K steps. The architecture and the masked-language-modeling objective are held fixed at BERT_LARGE.
If this is right
- <parameter name="0">Reported gains from newer pretraining objectives over BERT should be re-examined against compute-matched baselines
- since training budget alone closes most of the gap.
Where Pith is reading between the lines
- <parameter name="0">If most apparent progress over BERT is explained by training budget
- then benchmark leaderboards in this period are partly tracking compute spend rather than modeling ideas — a methodological caution that extends well beyond NLP.
Load-bearing premise
That fixing the architecture and objective while changing data, steps, batch size, and tokenizer constitutes a fair attribution of credit — the comparison with competing methods does not retune those methods under matched compute, so the claim that masked language modeling is "competitive" with newer objectives rests on the assumption that the competitors would not pull ahead again under the same scaling treatment.
What would settle it
Retrain a competing model (e.g. XLNet or a permutation/span-based variant) under matched data (160GB), matched batch size (8K), and matched step count (500K) using the same byte-level BPE and dynamic masking, and compare GLUE/SQuAD/RACE numbers head-to-head. If the competitor still beats RoBERTa by a clear margin under matched compute, the claim that masked language modeling is competitive with the alternatives fails.
read the original abstract
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a replication and ablation study of BERT pretraining. The authors reimplement BERT in fairseq, sweep four design axes — dynamic vs. static masking (§4.1), input format and the NSP loss (§4.2), batch size (§4.3), and byte-level BPE (§4.4) — and combine the favorable settings with substantially more data (160GB across BOOKS+WIKI, CC-NEWS, OPENWEBTEXT, STORIES) and more optimizer steps (up to 500K at 8K batch). The resulting model, RoBERTa, is reported to match or exceed all post-BERT published systems on GLUE (Table 5), SQuAD v1.1/v2.0 (Table 6), and RACE (Table 7), without multi-task finetuning on GLUE or external QA data on SQuAD. The central scientific claim is that BERT was significantly undertrained and that, with the right training recipe, the MLM objective is competitive with subsequently proposed alternatives such as permutation LM (XLNet).
Significance. If the result holds, the paper materially reshapes how the community attributes credit for the gains reported in 2018–2019: a sizable fraction of post-BERT improvement is attributable to data scale, batch size, and training length rather than to new objectives or architectures. This is a useful corrective and a high-value contribution to a literature where ablations against private data and undisclosed compute budgets have made comparisons unreliable. Concrete strengths are: (i) the ablations in Tables 1–3 are clean and use medians over five seeds; (ii) the introduction of CC-NEWS partially closes the public-data gap with concurrent work; (iii) models, code, and a documented hyperparameter recipe (Tables 9–10) are released, enabling third-party replication. The released artifact has in fact become a widely used baseline, which is itself evidence of the practical claim. The paper is appropriately modest in footnote 2 about the limits of its comparisons.
major comments (4)
- [§5, Table 4] The headline claim that MLM is 'competitive with' permutation LM is not cleanly supported by the most controlled row of Table 4. At matched BOOKS+WIKI data, XLNet_LARGE reports 94.0/87.8 on SQuAD 1.1/2.0 and 88.4 on MNLI-m, while RoBERTa-BOOKS+WIKI reports 93.6/87.3 and 89.0 — RoBERTa loses on SQuAD 2.0 and is within noise on MNLI. RoBERTa only clearly surpasses XLNet after adding ~10× more text and 5× more updates (500K), at which point XLNet itself is also no longer at its matched-data setting. Please either (a) restate the conclusion as 'MLM is competitive once given comparable or larger training budget,' or (b) report a compute- and data-matched comparison (same corpus, same token count seen, same batch and step budget). The current phrasing in §1 and §7 overstates what Table 4 shows.
- [§4.4 / Table 4] The switch to a 50K byte-level BPE adds approximately 20M parameters to BERT_LARGE (the paper's own estimate in §4.4). This confounds the BERT_LARGE → RoBERTa-BOOKS+WIKI comparison in Table 4 (90.9/81.8 → 93.6/87.3 on SQuAD), since part of the gap may reflect added embedding capacity rather than 'BERT was undertrained.' §4.4 states 'early experiments revealed only slight differences' but provides no table. A small ablation isolating 30K char-BPE vs 50K byte-BPE at otherwise matched settings (one row would suffice) would close this gap and is important because the BPE choice is one of the four pillars of the recipe.
- [§4.3, Table 3] The large-batch comparison varies batch size, step count, and learning rate jointly while reporting only perplexity and two GLUE dev metrics. The 2K-batch/125K-step setting outperforms both 256/1M and 8K/31K on perplexity (3.68 vs 3.99 vs 3.77), yet the paper adopts 8K for downstream experiments citing parallelization. Please clarify why 2K is not the preferred choice on the evidence presented, or report SQuAD/RACE numbers for the three settings so the choice is grounded in end-task performance rather than engineering convenience.
- [§4.2, Table 2] The conclusion that removing NSP 'matches or slightly improves' downstream task performance is drawn from differences that are often within plausible seed variance (e.g., FULL-SENTENCES 84.7 vs SEGMENT-PAIR+NSP 84.0 on MNLI-m; 92.5 vs 92.9 on SST-2). Reported numbers are medians over five seeds but no spread is given. Please report standard deviations or min/max across seeds for Table 2 so the reader can judge whether the NSP-removal effect exceeds noise; this matters because removing NSP is one of the four headline modifications.
minor comments (6)
- [§3.2] CC-NEWS filtering is described in one sentence ('76GB after filtering'). A short description of the filter (language ID, dedup, boilerplate removal) would help replication, especially since the dataset is presented as a contribution.
- [Table 4] The 'data' column lists 13GB for XLNet and 16GB for RoBERTa under 'BOOKS+WIKI'; footnote 3 attributes this to Wikipedia cleaning differences. Worth restating in the Table 4 caption so a casual reader does not mistake this for a data-budget mismatch in RoBERTa's favor at the matched row.
- [§5.1, WNLI] The WNLI procedure (margin ranking with spaCy-extracted candidates, SuperGLUE reformatting) is non-standard and excludes negative training examples. Given the 91.3 dev / 89.0 test number contributes to the average, a sentence explicitly flagging that this score is not directly comparable to other systems' WNLI numbers would be appropriate.
- [§4.1, Table 1] The dynamic-vs-static gap is small (e.g., 78.7 vs 78.3 SQuAD 2.0; 84.0 vs 84.3 MNLI). Calling dynamic masking 'comparable or slightly better' is fair, but the abstract and §1 list dynamic masking as one of four key improvements; consider softening the framing to match Table 1.
- [Typography] Several places contain OCR-like artifacts in the submitted PDF ('Y ang', 'Y ou', 'V aswani', 'B OOK CORPUS'); please verify font/encoding in the camera-ready.
- [§5] The Appendix hyperparameters (Tables 9–10) are useful; consider also reporting the total wall-clock and GPU-hours per pretraining run so future replications can budget appropriately. The text mentions '1024 V100 GPUs for approximately one day' but only for one configuration.
Simulated Author's Rebuttal
We thank the referee for the careful and substantive report, and in particular for distinguishing the empirical contribution (the recipe and the released artifact) from the rhetorical framing of the central claim. We accept the four major points essentially as stated. The referee is correct that (i) our 'competitive with permutation LM' claim is properly conditioned on training budget rather than asserted at matched data, (ii) the 50K byte-level BPE adds parameters and partially confounds the BERT_LARGE -> RoBERTa comparison in Table 4, (iii) the choice of 8K batch over 2K in Table 3 is motivated by parallelization rather than by a clean end-task win, and (iv) several of the NSP-removal contrasts in Table 2 are within seed noise and the prose should reflect this. We will revise §1, §4.2, §4.3, §4.4, §5 and §7 accordingly, add per-seed spread to Table 2, add the BASE-scale BPE comparison and SQuAD numbers for the Table 3 settings to the appendix, and explicitly bound the headline claim. Two items — a strictly token-matched re-run against XLNet, and a LARGE-scale char-BPE vs byte-BPE ablation — we cannot produce within the revision window; we list these as standing objections and will disclose them rather than overclaim.
read point-by-point responses
-
Referee: Major #1 [§5, Table 4]: Headline claim that MLM is 'competitive with' permutation LM is not cleanly supported at matched BOOKS+WIKI. RoBERTa loses on SQuAD 2.0 (87.3 vs 87.8) and is within noise on MNLI; the win only emerges after ~10x data and 5x steps, by which point XLNet is also off its matched-data setting. Restate to 'competitive once given comparable or larger budget,' or run a strictly compute- and data-matched comparison.
Authors: The referee is right that our matched-data row is the appropriate basis for the strongest version of the claim, and that on that row RoBERTa is essentially tied with (and slightly behind on SQuAD 2.0) XLNet_LARGE rather than dominating it. Our intended claim was the weaker one the referee articulates: that MLM remains competitive with permutation LM when given a comparable or larger training budget, and that a substantial portion of the post-BERT gains attributed to new objectives can be recovered by training scale alone. We will revise §1 and §7 to state this more precisely, replacing 'match or exceed every model published after it' in unqualified form with language that explicitly conditions on training budget. We will also add a sentence to §5 noting that at matched BOOKS+WIKI / 1M-equivalent budget, RoBERTa and XLNet_LARGE are within ~0.5 points on SQuAD/MNLI, and that the larger-budget rows of Table 4 are not budget-matched against XLNet's own larger-budget row (94.5/88.8, 89.8 with 126GB / 500K / batch 2K). A strictly token-matched re-run against XLNet is unfortunately outside what we can produce within the revision window — XLNet's permutation training has a different effective tokens-per-step accounting, and we do not have access to their exact data composition — but we will state this limitation explicitly rather than paper over it. revision: yes
-
Referee: Major #2 [§4.4 / Table 4]: The 50K byte-level BPE adds ~20M parameters to BERT_LARGE, confounding the BERT_LARGE -> RoBERTa-BOOKS+WIKI comparison. §4.4 asserts 'only slight differences' but shows no table. Provide a one-row ablation isolating 30K char-BPE vs 50K byte-BPE at matched settings.
Authors: We agree this is a real confound and that §4.4's qualitative remark is not a substitute for a number. Our internal early experiments compared the two encodings at BERT_BASE scale with otherwise matched settings and did not show systematic gains for byte-level BPE (in fact slightly worse on some tasks, as noted), which is why we framed the choice as motivated by universality rather than accuracy. We will add a row to the appendix giving the head-to-head dev numbers we have at BASE scale, and we will explicitly flag in §4.4 and in the discussion of Table 4 that the ~20M-parameter increase at LARGE is a confound for the BERT_LARGE -> RoBERTa-BOOKS+WIKI delta, so that readers do not attribute the full 90.9 -> 93.6 SQuAD 1.1 gap to 'undertraining.' We do not have a fully matched 30K-char vs 50K-byte run at LARGE scale, and we will say so rather than overclaim. revision: partial
-
Referee: Major #3 [§4.3, Table 3]: Batch size, steps, and learning rate vary jointly; only ppl + two GLUE metrics are reported. 2K/125K beats 8K/31K on ppl but 8K is adopted citing parallelization. Justify on end-task performance or report SQuAD/RACE for the three settings.
Authors: The referee has correctly identified that Table 3 does not on its face justify 8K over 2K on accuracy grounds. The honest statement of our reasoning is engineering: at the scale of the §5 experiments (1024 V100s, 500K steps, 160GB), 8K batches were materially easier to keep utilization high under distributed data-parallel training, and the dev-set differences we observed between 2K and 8K at this controlled BASE-scale setup were small and did not consistently favor 2K on downstream tasks beyond what Table 3 shows. We will (i) explicitly state in §4.3 that the choice of 8K over 2K is driven by parallelization rather than by an accuracy advantage on Table 3, (ii) add SQuAD numbers for the three Table 3 settings to the appendix where we have them, and (iii) soften the implication that 8K is optimal on the evidence presented. We agree this is a fair correction. revision: yes
-
Referee: Major #4 [§4.2, Table 2]: NSP-removal effects are within plausible seed variance (e.g., FULL-SENTENCES 84.7 vs SEGMENT-PAIR+NSP 84.0 on MNLI-m; 92.5 vs 92.9 on SST-2). Report std / min-max across the five seeds so readers can judge whether the effect exceeds noise.
Authors: This is well taken. Our claim in §4.2 is deliberately phrased as 'matches or slightly improves' rather than 'improves,' precisely because for several of the cells the gap is within what we observe across seeds, and the stronger statement we make is the negative one — that retaining NSP does not help and that SENTENCE-PAIR (which forces short inputs) clearly hurts. We will add per-cell spread (std and min/max over the five seeds) to Table 2 in the revision, for both the NSP and the input-format rows, so the reader can see directly which contrasts are above seed noise (SEGMENT-PAIR vs SENTENCE-PAIR; SEGMENT-PAIR vs DOC-SENTENCES on SQuAD/RACE) and which are not (FULL-SENTENCES vs SEGMENT-PAIR+NSP on MNLI/SST-2). We will also adjust the prose in §4.2 and §7 so that the headline summary about NSP is 'removing NSP does not hurt, and removing it together with the SENTENCE-PAIR format helps,' rather than implying a uniform improvement. revision: yes
- A strictly token-, batch-, and step-matched head-to-head against XLNet (Major #1) is not feasible within the revision window: we lack access to XLNet's exact data composition and the permutation objective's per-step token accounting differs from MLM's. We will instead bound the claim and disclose the limitation, rather than produce a comparison we cannot run cleanly.
- We do not have a fully matched 30K char-BPE vs 50K byte-BPE ablation at LARGE scale (Major #2). We can add the BASE-scale comparison we did run, and we will flag the parameter-count confound at LARGE explicitly, but a LARGE-scale matched ablation is beyond the compute we can commit to this revision.
Circularity Check
No meaningful circularity: RoBERTa's claims are evaluated on external held-out benchmarks (GLUE leaderboard, SQuAD, RACE), not on quantities fitted by the authors.
full rationale
This is an empirical replication/ablation study of BERT pretraining. The central claims — that dynamic masking, removing NSP, larger batches, byte-level BPE, more data, and longer training each improve downstream performance, and that the resulting model matches/exceeds XLNet — are evaluated against external benchmarks (GLUE test via a third-party leaderboard, SQuAD 1.1/2.0, RACE) using metrics (F1, EM, accuracy) defined outside the paper. Hyperparameters are tuned on dev sets and reported on test sets; there is no instance of a fitted parameter being renamed as a prediction, no self-definitional loop, and no load-bearing self-citation that would constitute circularity in the technical sense. Comparisons to BERT and XLNet quote numbers from those papers (Devlin et al. 2019; Yang et al. 2019), which are independent prior work, not self-citations of the present authors used to forbid alternatives. The reader's and skeptic's critiques are real concerns but are about *fairness of attribution under non-matched compute/data budgets*, not about circularity. Specifically: (i) at matched BOOKS+WIKI, XLNet edges RoBERTa on SQuAD 2.0 (87.8 vs 87.3), so the "MLM ≈ permutation LM" claim leans on the extra-data/extra-steps rows; (ii) the byte-level BPE adds ~20M params with no isolated ablation ("early experiments revealed only slight differences" — Section 4.4 — but no table). These are confounds in causal attribution and belong under correctness/scope risk, not circularity. The derivation chain itself does not collapse to its inputs by construction. A score of 1 reflects only routine self-citation (Ott et al. 2018, 2019 for fairseq and large-batch NMT) which is methodological tooling, not load-bearing for the empirical claims.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
-
Measuring Massive Multitask Language Understanding
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matchin...
-
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
-
Is She Even Relevant? When BERT Ignores Explicit Gender Cues
A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
-
Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA
GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
-
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
-
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning
AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.
-
Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs
LLMs outperform single human raters at spotting relative weaknesses in L2 writing profiles on the ICNALE GRA dataset while humans are better at spotting strengths, using a self-referential intra-learner evaluation method.
-
Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...
-
Deep Graph-Language Fusion for Structure-Aware Code Generation
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
-
MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports
MedStruct-S benchmark shows encoder-only models outperform larger decoder-only ones on key-conditioned QA from noisy OCR clinical reports, with fine-tuned large models winning only when scale is ignored.
-
How Language Models Process Negation
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
-
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
-
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
-
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
-
DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...
-
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.
-
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
A two-agent adversarial rewriting framework achieves 20-40% evasion rates against LLM-based misinformation detectors under strict black-box constraints with binary feedback only, far outperforming prior methods and li...
-
Not all ANIMALs are equal: metaphorical framing through source domains and semantic frames
An NLP framework shows that liberals and conservatives use different semantic frames within the same metaphorical source domains when discussing immigration, while also uncovering nuanced frames in climate change coverage.
-
A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition
Low information density is identified as the root cause of NER failures on user-generated content, with the Window-Aware Optimization Module delivering up to 4.5% F1 gains and new SOTA on WNUT2017.
-
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
-
SecureRouter: Encrypted Routing for Efficient Secure Inference
SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport
GCTM-OT extracts goal candidates with an LLM, then uses goal-prompted contrastive learning and optimal transport to discover topics that are more coherent, diverse, and aligned with human intent than prior methods on ...
-
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
METRO induces both short-term actions and long-term planning from expert transcripts into a Strategy Forest, outperforming prior methods by 9-10% on two non-collaborative dialogue benchmarks.
-
SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates
LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.
-
BMdataset: A Musicologically Curated LilyPond Dataset
A musicologically curated LilyPond dataset of 393 Baroque scores enables LilyBERT to outperform large-scale pre-training on composer and style classification when used alone for fine-tuning.
-
LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset
LASQ is a new quadruple extraction dataset for Uzbek and Uyghur that includes a syntax-aware model showing gains over baselines on the task.
-
Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities
TransFIR enables reasoning on temporal knowledge graphs for emerging entities by clustering them into semantic groups and borrowing interaction histories from similar known entities, yielding 28.6% average MRR gains.
-
Mask-Free Privacy Extraction and Rewriting: A Domain-Aware Approach via Prototype Learning
DAMPER learns domain privacy prototypes via contrastive learning and uses them to guide mask-free privacy extraction, preference-aligned rewriting, and differential privacy sampling for LLMs.
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
Follow My Eyes: Backdoor Attacks on VLM-based Scanpath Prediction
Backdoor attacks on VLM-based scanpath predictors can redirect fixations toward chosen objects or inflate durations using input-conditioned triggers that evade cluster detection, and no tested defense blocks them with...
-
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
-
iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations
iTAG generates natural text paired with accurate causal graph annotations by framing concept assignment as an inverse problem and refining selections via chain-of-thought reasoning until the text's relations align wit...
-
Graph Topology Information Enhanced Heterogeneous Graph Representation Learning
ToGRL learns high-quality graph structures from raw heterogeneous graphs via a two-stage topology extraction process and prompt tuning, outperforming prior methods on five datasets.
-
The Indra Representation Hypothesis for Multimodal Alignment
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal...
-
SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models
SeaAlert generates synthetic noisy maritime distress transcripts via LLM and ASR simulation to train robust extraction of critical information from real VHF communications.
-
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Tulu 3 provides open SOTA post-trained LLMs with a novel RLVR algorithm and complete reproducibility artifacts that surpass Llama 3.1 instruct, Qwen 2.5, Mistral, GPT-4o-mini, and Claude 3.5-Haiku on benchmarks.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
-
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...
-
Longformer: The Long-Document Transformer
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
-
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
-
Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing
Pre-trained encoder-decoder transformers fine-tuned for sequence-to-sequence constituent parsing outperform prior seq2seq models and compete with specialized parsers on continuous treebanks.
Reference graph
Works this paper leans on
-
[1]
Eneko Agirre, Llu' i s M`arquez, and Richard Wicentowski, editors. 2007. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)
work page 2007
-
[2]
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. 2019. Cloze-driven pretraining of self-attention networks. arXiv preprint arXiv:1903.07785
work page Pith review arXiv 2019
-
[3]
Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second PASCAL recognising textual entailment challenge. In Proceedings of the second PASCAL challenges workshop on recognising textual entailment
work page 2006
-
[4]
Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge
work page 2009
-
[5]
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In Empirical Methods in Natural Language Processing (EMNLP)
work page 2015
- [6]
-
[7]
Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment
work page 2006
-
[8]
Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems (NIPS)
work page 2015
-
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL)
work page 2019
-
[10]
William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing
work page 2005
- [11]
-
[12]
Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing
work page 2007
-
[13]
Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://web.archive.org/save/http://Skylion007.github.io/OpenWebTextCorpus
work page 2019
-
[14]
Felix Hamborg, Norman Meuschke, Corinna Breitinger, and Bela Gipp. 2017. news-please: A generic news crawler and extractor. In Proceedings of the 15th International Symposium of Information Science
work page 2017
-
[15]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415
work page Pith review arXiv 2016
-
[16]
Matthew Honnibal and Ines Montani. 2017. spaCy 2 : Natural language understanding with B loom embeddings, convolutional neural networks and incremental parsing. To appear
work page 2017
- [17]
-
[18]
Shankar Iyer, Nikhil Dandekar, and Kornél Csernai. 2016. First quora dataset release: Question pairs. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs
work page 2016
-
[19]
SpanBERT: Improving pre-training by representing and predicting spans
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT : Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529
-
[20]
Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR)
work page 2015
- [21]
-
[22]
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683
work page Pith review arXiv 2017
- [23]
-
[24]
Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The W inograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning
work page 2011
- [25]
- [26]
-
[27]
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems (NIPS), pages 6297--6308
work page 2017
-
[28]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed precision training. In International Conference on Learning Representations
work page 2018
-
[29]
Sebastian Nagel. 2016. Cc-news. http://web.archive.org/save/http://commoncrawl.org/2016/10/news-dataset-available
work page 2016
-
[30]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq : A fast, extensible toolkit for sequence modeling. In North American Association for Computational Linguistics (NAACL): System Demonstrations
work page 2019
-
[31]
Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT)
work page 2018
-
[32]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch . In NIPS Autodiff Workshop
work page 2017
-
[33]
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In North American Association for Computational Linguistics (NAACL)
work page 2018
-
[34]
Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI
work page 2018
-
[35]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Technical report, OpenAI
work page 2019
-
[36]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don't know: Unanswerable questions for squad. In Association for Computational Linguistics (ACL)
work page 2018
-
[37]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD : 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP)
work page 2016
-
[38]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Association for Computational Linguistics (ACL), pages 1715--1725
work page 2016
-
[39]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing (EMNLP)
work page 2013
-
[40]
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS : Masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning (ICML)
work page 2019
- [41]
- [42]
-
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems
work page 2017
- [44]
-
[45]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019 b . GLUE : A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR)
work page 2019
- [46]
-
[47]
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In North American Association for Computational Linguistics (NAACL)
work page 2018
- [48]
- [49]
- [50]
- [51]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.