OPT: Open Pre-trained Transformer Language Models

Anjali Sridhar; Christopher Dewan; Daniel Simig; Kurt Shuster; Luke Zettlemoyer; Mikel Artetxe; Mona Diab; Moya Chen; Myle Ott; Naman Goyal

arxiv: 2205.01068 · v4 · submitted 2022-05-02 · 💻 cs.CL · cs.LG

OPT: Open Pre-trained Transformer Language Models

Susan Zhang , Stephen Roller , Naman Goyal , Mikel Artetxe , Moya Chen , Shuohui Chen , Christopher Dewan , Mona Diab

show 11 more authors

Xian Li Xi Victoria Lin Todor Mihaylov Myle Ott Sam Shleifer Kurt Shuster Daniel Simig Punit Singh Koura Anjali Sridhar Tianlu Wang Luke Zettlemoyer

This is my paper

Pith reviewed 2026-05-10 20:48 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords large language modelsopen sourcepre-trained transformersGPT-3carbon footprintdecoder-only modelsfew-shot learningmodel release

0 comments

The pith

A suite of open decoder-only transformer models up to 175B parameters matches GPT-3 performance while using only one-seventh the carbon footprint.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a collection of pre-trained language models called OPT that range in size from 125 million to 175 billion parameters. These models are made available with full weights and training code to allow broad research access. The central demonstration is that the largest version performs similarly to the closed GPT-3 model but requires substantially less energy and emissions to train. This matters because it lowers the barrier for studying and improving large language models beyond a small number of organizations with massive resources.

Core claim

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We

What carries the argument

The OPT suite, a collection of openly released decoder-only pre-trained transformer language models ranging from 125M to 175B parameters that includes full weights, training code, and infrastructure logs.

If this is right

Researchers can directly access and modify the full model weights instead of relying on restricted APIs.
Large-scale language model development becomes feasible with substantially lower carbon emissions.
The released code allows experimentation across the full range of model sizes from 125M to 175B parameters.
Infrastructure logs provide concrete details on challenges encountered during training of these models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread access to the weights could enable more groups to test safety and bias mitigation techniques on models of this scale.
Lower training costs may support repeated fine-tuning cycles that were previously impractical for non-industry labs.
The open release creates a direct path for third parties to verify the reported performance and emissions numbers.

Load-bearing premise

That the benchmarks and evaluation protocols used to establish comparability between OPT-175B and GPT-3 are fair, comprehensive, and not affected by differences in training data or optimization details.

What would settle it

An independent run of OPT-175B on the same zero- and few-shot benchmarks as GPT-3 that shows a clear performance gap, or a recalculation of training emissions that exceeds one-seventh of the GPT-3 figure.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the OPT suite of decoder-only pre-trained transformer language models with sizes ranging from 125M to 175B parameters. The central claims are that OPT-175B achieves performance comparable to GPT-3 across zero- and few-shot tasks while requiring only 1/7th the carbon footprint to develop, and that the models, training logbook, and code will be released to enable broader research.

Significance. If the performance and carbon claims hold under transparent evaluation, the work is significant for lowering barriers to studying large language models by providing open weights and infrastructure details. The release of code and logs supports reproducibility, and the carbon reduction highlights practical efficiencies in training at scale.

major comments (2)

Abstract and carbon footprint section: The headline claim that OPT-175B requires only 1/7th the carbon footprint of GPT-3 depends on an external third-party estimate for GPT-3 emissions. The manuscript must include a side-by-side table or explicit comparison of all assumptions (TDP, PUE, hardware utilization, effective FLOPs per token, and cluster power draw) used for both models; without this, the scalar ratio is not robust or independently verifiable from the OPT measurements alone.
Evaluation section (results tables): The statement of comparability to GPT-3 is load-bearing but presented without error bars, run-to-run variance, or a complete list of tasks and exact scores in a single consolidated table. This makes it difficult to assess whether differences are statistically meaningful or affected by training data/optimization details, as noted in the weakest assumption.

minor comments (2)

The logbook release is a strength for transparency; however, it would benefit from an index or summary table mapping challenges to specific training stages or model sizes.
Notation for model sizes (e.g., OPT-175B) is clear, but ensure all hyperparameter tables in the appendix explicitly list learning rate schedules, batch sizes, and data mixtures for each scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the planned revisions.

read point-by-point responses

Referee: Abstract and carbon footprint section: The headline claim that OPT-175B requires only 1/7th the carbon footprint of GPT-3 depends on an external third-party estimate for GPT-3 emissions. The manuscript must include a side-by-side table or explicit comparison of all assumptions (TDP, PUE, hardware utilization, effective FLOPs per token, and cluster power draw) used for both models; without this, the scalar ratio is not robust or independently verifiable from the OPT measurements alone.

Authors: We agree that a transparent comparison of assumptions is necessary to support the carbon claim. In the revised manuscript we will add a side-by-side table in the carbon footprint section that explicitly lists TDP, PUE, hardware utilization, effective FLOPs per token, and cluster power draw for both OPT-175B (our measurements) and the GPT-3 estimate. This will allow readers to inspect the basis of the 1/7th ratio directly. revision: yes
Referee: Evaluation section (results tables): The statement of comparability to GPT-3 is load-bearing but presented without error bars, run-to-run variance, or a complete list of tasks and exact scores in a single consolidated table. This makes it difficult to assess whether differences are statistically meaningful or affected by training data/optimization details, as noted in the weakest assumption.

Authors: We acknowledge that a consolidated table improves clarity. Due to the prohibitive cost of training at this scale we performed only a single run for OPT-175B and therefore cannot supply run-to-run variance or error bars. We will revise the evaluation section to present all zero- and few-shot results in one consolidated table with exact scores for every task, and we will add explicit text noting the single-run limitation and its implications for statistical comparison. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training results and external benchmark comparisons

full rationale

The paper presents direct empirical results from training decoder-only transformers (125M to 175B parameters) and evaluates them on standard zero- and few-shot benchmarks against GPT-3. The carbon-footprint comparison (1/7th) relies on an external third-party estimate for GPT-3 rather than any self-derived quantity or fitted parameter. No equations, ansatzes, uniqueness theorems, or self-citations reduce claims to inputs by construction; the derivation chain consists of reported training runs and external references, remaining self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This review is based solely on the abstract; full methods, hyperparameters, data, and evaluation details are unavailable. No free parameters, axioms, or invented entities can be audited from the provided text.

pith-pipeline@v0.9.0 · 5503 in / 1023 out tokens · 37177 ms · 2026-05-10T20:48:21.628279+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ORPO: Monolithic Preference Optimization without Reference Model
cs.CL 2024-03 conditional novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Instruction Tuning with GPT-4
cs.CL 2023-04 unverdicted novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Code as Policies: Language Model Programs for Embodied Control
cs.RO 2022-09 accept novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
BioDefect: The First Dataset for Defect Detection in Bioinformatics Software
cs.SE 2026-05 unverdicted novelty 7.0

BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
Modality-Decoupled Online Recursive Editing
cs.LG 2026-05 conditional novelty 7.0

M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
cs.LG 2026-05 conditional novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
cs.LG 2026-05 unverdicted novelty 7.0

Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
cs.LG 2026-05 conditional novelty 7.0

Rank-1 activation steering is often cheap when prompt-boundary alignment guides budgeted search and concept granularity diagnoses directional stability, with the GRACE framework reducing trials to 95% utility by 39.8%...
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
cs.LG 2026-05 unverdicted novelty 7.0

Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
cs.LG 2026-05 accept novelty 7.0

Apple MPS decoding exhibits non-monotonic latency with spikes up to 21x due to KV cache interactions and execution regimes, unlike monotonic behavior on CPU and CUDA.
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
cs.LG 2026-05 unverdicted novelty 7.0

Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization
cs.LG 2026-05 unverdicted novelty 7.0

PACZero achieves zero mutual information privacy for LLM fine-tuning via sign-quantized zeroth-order gradients, delivering near-non-private accuracy on SST-2 and SQuAD at I=0.
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation
cs.CL 2026-05 unverdicted novelty 7.0

MTA improves LLM knowledge distillation by aligning representations along layer-wise trajectories with adaptive granularity from words to phrases using dynamic structural and hidden representation alignment losses.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
cs.CL 2026-04 conditional novelty 7.0

A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and toke...
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
cs.CL 2026-04 unverdicted novelty 7.0

Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
cs.PF 2026-04 unverdicted novelty 7.0

HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU
cs.AR 2026-04 unverdicted novelty 7.0

A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
cs.AR 2026-04 conditional novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
On the Invariants of Softmax Attention
cs.LG 2026-04 unverdicted novelty 7.0

Softmax attention has algebraic invariants including zero-sum rows and head-dimension rank limits, plus consistent variance spread in language models attributed to key incoherence.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
cs.CV 2026-02 unverdicted novelty 7.0

Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent g...
HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training
cs.LG 2026-01 unverdicted novelty 7.0

HOSL reduces client memory up to 3.7x versus full first-order split learning while staying within 0.20-4.23% accuracy on OPT models by pairing client zeroth-order estimation with server first-order optimization.
DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack
cs.CR 2025-12 unverdicted novelty 7.0

DualGuard uses adaptive dual-stream watermark signals to detect and trace both paraphrase and spoofing attacks in LLM outputs while preserving text quality.
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data
cs.CL 2025-12 conditional novelty 7.0

PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with...
All is Not Lost: LLM Recovery without Checkpoints
cs.DC 2025-06 conditional novelty 7.0

CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding st...
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
cs.CL 2024-12 unverdicted novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Federated Co-tuning Framework for Large and Small Language Models
cs.CL 2024-11 unverdicted novelty 7.0

FedCoLLM is a parameter-efficient federated co-tuning framework that improves client SLMs via server LLMs and enriches LLMs with client domain insights using adapters on NLP text generation tasks.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Topic-Based Watermarks for Large Language Models
cs.CR 2024-04 unverdicted novelty 7.0

A topic-guided watermarking scheme partitions the LLM vocabulary into topic-aligned token subsets and green-lists relevant tokens based on the input prompt to embed detectable marks while preserving text quality and i...
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Detecting Pretraining Data from Large Language Models
cs.CL 2023-10 conditional novelty 7.0

Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
cs.CV 2023-10 accept novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers
cs.CL 2023-09 unverdicted novelty 7.0

EvoPrompt uses LLMs to run evolutionary operators on populations of prompts, outperforming human-engineered prompts by up to 25% on BIG-Bench Hard tasks across 31 datasets.
Efficient Memory Management for Large Language Model Serving with PagedAttention
cs.LG 2023-09 conditional novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
Steering Language Models With Activation Engineering
cs.CL 2023-08 unverdicted novelty 7.0

Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
The Curse of Recursion: Training on Generated Data Makes Models Forget
cs.LG 2023-05 conditional novelty 7.0

Use of model-generated content in training causes irreversible loss of distribution tails, termed model collapse, in VAEs, GMMs, and LLMs.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
cs.CV 2023-01 unverdicted novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
cs.LG 2022-10 unverdicted novelty 7.0

GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Quantifying Memorization Across Neural Language Models
cs.LG 2022-02 unverdicted novelty 7.0

Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
TimeGuard: Channel-wise Pool Training for Backdoor Defense in Time Series Forecasting
cs.CR 2026-05 unverdicted novelty 6.0

TimeGuard employs channel-wise pool training initialized with time-aware criteria and distance-regularized loss selection to defend time series forecasting against backdoor attacks, improving robustness by 1.96x while...
Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies
cs.CL 2026-05 unverdicted novelty 6.0

Self-training restructures language by amplifying surface markers and collapsing deep syntax according to structural depth rather than frequency, as evidenced by correlations across multiple models and a human fine-tu...
DP-SelFT: Differentially Private Selective Fine-Tuning for Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

DP-SelFT improves the privacy-utility trade-off for LLM fine-tuning by selecting robust layer subsets via DP synthetic data and perturbation-matched evaluation.
Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning
cs.LG 2026-05 unverdicted novelty 6.0

Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.
Instructions Shape Production of Language, not Processing
cs.CL 2026-05 unverdicted novelty 6.0

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
cs.LG 2026-05 conditional novelty 6.0

ZO-MOPI accelerates zeroth-order LLM fine-tuning by applying partial spectral orthogonalization from power iteration inside a momentum-projected subspace to reduce variance and exploit dominant directions.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask
cs.LG 2026-05 unverdicted novelty 6.0

SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression
cs.LG 2026-05 unverdicted novelty 6.0

DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
cs.CR 2026-05 conditional novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.

Reference graph

Works this paper leans on

294 extracted references · 294 canonical work pages · cited by 199 Pith papers · 36 internal anchors

[1]

Naman Goyal and Cynthia Gao and Vishrav Chaudhary and Peng. The. CoRR , volume =. 2021 , url =. 2106.03193 , timestamp =

work page arXiv 2021
[2]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6399 , number=

work page doi:10.1609/aaai.v34i05.6399 2020
[3]

PIQA: Reasoning about physical commonsense in natural language

PIQA: Reasoning about Physical Commonsense in Natural Language , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6239 , number=

work page doi:10.1609/aaai.v34i05.6239 2020
[4]

Neural Network Ac- ceptability Judgments,

Neural Network Acceptability Judgments , author=. arXiv preprint 1805.12471 , year=

work page arXiv
[5]

Biochimica et Biophysica Acta (BBA)-Protein Structure , volume=

Comparison of the predicted and observed secondary structure of T4 phage lysozyme , author=. Biochimica et Biophysica Acta (BBA)-Protein Structure , volume=. 1975 , publisher=

work page 1975
[6]

Character-level convolutional networks for text classification , author=

work page
[7]

Quantifying the Carbon Emissions of Machine Learning

Quantifying the Carbon Emissions of Machine Learning , author=. arXiv preprint arXiv:1910.09700 , year=

work page internal anchor Pith review arXiv 1910
[8]

arXiv preprint arXiv:2003.11942 , year=

Towards backward-compatible representation learning , author=. arXiv preprint arXiv:2003.11942 , year=

work page arXiv 2003
[9]

2020 , eprint=

Training with Quantization Noise for Extreme Model Compression , author=. 2020 , eprint=

work page 2020
[10]

International Conference on Learning Representations , year=

What do you learn from context? Probing for sentence structure in contextualized word representations , author=. International Conference on Learning Representations , year=

work page
[11]

arXiv preprint arXiv:1905.05950 , year=

BERT rediscovers the classical NLP pipeline , author=. arXiv preprint arXiv:1905.05950 , year=

work page arXiv 1905
[12]

Multi-task sequence to sequence learning , author=

work page
[13]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[14]

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

Bam! born-again multi-task networks for natural language understanding , author=. arXiv preprint arXiv:1907.04829 , year=

work page Pith review arXiv 1907
[15]

Machine learning , volume=

Multitask learning , author=. Machine learning , volume=. 1997 , publisher=

work page 1997
[16]

An Overview of Multi-Task Learning in Deep Neural Networks

An overview of multi-task learning in deep neural networks , author=. arXiv preprint arXiv:1706.05098 , year=

work page internal anchor Pith review arXiv
[17]

Proceedings of the 42nd annual meeting on Association for Computational Linguistics , pages=

A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts , author=. Proceedings of the 42nd annual meeting on Association for Computational Linguistics , pages=. 2004 , organization=

work page 2004
[18]

Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media , pages=

A survey on hate speech detection using natural language processing , author=. Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media , pages=

work page
[19]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Parameter-Efficient Transfer Learning for NLP , author=

work page
[21]

Proceedings of the 13th International Workshop on Semantic Evaluation , pages=

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , author=. Proceedings of the 13th International Workshop on Semantic Evaluation , pages=

work page 2019
[22]

Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , booktitle=

work page
[23]

The Natural Language Decathlon: Multitask Learning as Question Answering

The Natural Language Decathlon: Multitask Learning as Question Answering , author=. arXiv preprint arXiv:1806.08730 , year=

work page Pith review arXiv
[24]

Proceedings of the 25th international conference on Machine learning , pages=

A unified architecture for natural language processing: Deep neural networks with multitask learning , author=. Proceedings of the 25th international conference on Machine learning , pages=. 2008 , organization=

work page 2008
[25]

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=

Humor recognition and humor anchor extraction , author=. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2015
[26]

Proceedings of the conference on empirical methods in natural language processing , pages=

Revisiting readability: A unified framework for predicting text quality , author=. Proceedings of the conference on empirical methods in natural language processing , pages=. 2008 , organization=

work page 2008
[27]

Weld and Luke Zettlemoyer and Omer Levy , year=

Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy , year=

work page
[28]

Zhang, Zhengyan and Han, Xu and Liu, Zhiyuan and Jiang, Xin and Sun, Maosong and Liu, Qun , booktitle=acl, year=

work page
[29]

Yu Stephanie Sun and Shuohuan Wang and Yukun Li and Shikun Feng and Xuyi Chen and Han Zhang and Xinlun Tian and Danxiang Zhu and Hao Tian and Hua Wu , journal=

work page
[30]

International Conference on Learning Representations , year=

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , author=. International Conference on Learning Representations , year=

work page
[31]

Advances in neural information processing systems , pages=

Skip-thought vectors , author=. Advances in neural information processing systems , pages=

work page
[32]

Learning Distributed Representations of Sentences from Unlabelled Data

Hill, Felix and Cho, Kyunghyun and Korhonen, Anna. Learning Distributed Representations of Sentences from Unlabelled Data. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. doi:10.18653/v1/N16-1162

work page doi:10.18653/v1/n16-1162 2016
[33]

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , booktitle =

Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Lo\". Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , booktitle =. 2017 , address =

work page 2017
[34]

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks

To tune or not to tune? adapting pretrained representations to diverse tasks , author=. arXiv preprint arXiv:1903.05987 , year=

work page Pith review arXiv 1903
[35]

Uniﬁed language model pre- training for natural language understanding and gen- eration

Unified Language Model Pre-training for Natural Language Understanding and Generation , author=. arXiv preprint arXiv:1905.03197 , year=

work page arXiv 1905
[36]

Chan, William and Kitaev, Nikita and Guu, Kelvin and Stern, Mitchell and Uszkoreit, Jakob , journal=

work page
[37]

Learned in translation: Contextualized word vectors , author=

work page
[38]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=naacl, year=

work page
[39]

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding , author=. arXiv preprint arXiv:1906.08237 , year=

work page internal anchor Pith review arXiv 1906
[41]

Cloze-driven Pretraining of Self-attention Networks

Cloze-driven pretraining of self-attention networks , author=. arXiv preprint arXiv:1903.07785 , year=

work page Pith review arXiv 1903
[42]

International Conference on Learning Representations , year=

Adaptive Input Representations for Neural Language Modeling , author=. International Conference on Learning Representations , year=

work page
[43]

Generating Long Sequences with Sparse Transformers

Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[44]

OpenWebText Corpus , author=

work page
[45]

A Fair Comparison Study of XLNet and BERT with Large Models , author=

work page
[46]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page internal anchor Pith review arXiv 1904
[47]

One weird trick for parallelizing convolutional neural networks

One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=

work page Pith review arXiv
[48]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[49]

First Quora Dataset Release: Question Pairs , author=

work page
[50]

Sara Bergman , howpublished=

work page
[51]

2017 , booktitle =

Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela , title =. 2017 , booktitle =

work page 2017
[52]

Defending against neural fake news

Defending Against Neural Fake News , author=. arXiv preprint arXiv:1905.12616 , year=

work page arXiv 1905
[53]

, booktitle=iclr, year=

Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , booktitle=iclr, year=

work page
[54]

and Schwenk, Holger and Stoyanov, Veselin

Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin. XNLI: Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

work page 2018
[55]

Bowman , journal=

Alex Wang and Yada Pruksachatkun and Nikita Nangia and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman , journal=. Super

work page
[56]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=

work page
[57]

De Marneffe, Marie-Catherine and Simons, Mandy and Tonhauser, Judith , note=

work page
[58]

2011 AAAI Spring Symposium Series , year=

Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

work page 2011
[59]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

work page 2018
[60]

Sheng Zhang and Xiaodong Liu and Jingjing Liu and Jianfeng Gao and Kevin Duh and Benjamin Van Durme , journal=

work page
[61]

Dagan, Ido and Glickman, Oren and Magnini, Bernardo , booktitle=. The. 2006 , publisher=

work page 2006
[62]

The second

Bar Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , year=. The second

work page
[63]

The third

Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill , booktitle=. The third. 2007 , organization=

work page 2007
[64]

The Fifth

Bentivogli, Luisa and Dagan, Ido and Dang, Hoa Trang and Giampiccolo, Danilo and Magnini, Bernardo , booktitle=. The Fifth

work page
[65]

Pilehvar, Mohammad Taher and Camacho-Collados, Jose , booktitle=

work page
[66]

Proceedings of NAACL-HLT , year=

Gender Bias in Coreference Resolution , author=. Proceedings of NAACL-HLT , year=

work page
[67]

Proceedings of EMNLP , year=

Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , author=. Proceedings of EMNLP , year=

work page
[68]

Levesque, Hector J and Davis, Ernest and Morgenstern, Leora , booktitle=. The

work page
[69]

Automatic Differentiation in

Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam , booktitle=. Automatic Differentiation in

work page
[70]

Neural Machine Translation of Rare Words with Subword Units , author=

work page
[71]

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , author=. arXiv preprint arXiv:1904.09482 , year=

work page Pith review arXiv 1904
[72]

A surprisingly robust trick for winograd schema challenge

A Surprisingly Robust Trick for Winograd Schema Challenge , author=. arXiv preprint arXiv:1905.06290 , year=

work page arXiv 1905
[73]

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , author=. arXiv preprint arXiv:1811.01088 , year=

work page Pith review arXiv
[74]

2017 , Note =

Honnibal, Matthew and Montani, Ines , TITLE =. 2017 , Note =

work page 2017
[75]

International Conference on Learning Representations , year=

Mixed Precision Training , author=. International Conference on Learning Representations , year=

work page
[76]

Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli , booktitle = naacl_demo, year =

work page
[78]

How to ﬁne-tune bert for text classiﬁcation?arXiv preprint arXiv:1905.05583, 2019

How to Fine-Tune BERT for Text Classification? , author=. arXiv preprint arXiv:1905.05583 , year=

work page arXiv 1905
[79]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author =

work page
[80]

5th Workshop on Energy Efficient Machine Learning and Cognitive Computing , year=

Q8bert: Quantized 8bit bert , author=. 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing , year=

work page
[81]

Multi-Task Deep Neural Networks for Natural Language Understanding

Multi-Task Deep Neural Networks for Natural Language Understanding , author=. arXiv preprint arXiv:1901.11504 , year=

work page Pith review arXiv 1901
[82]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

Naman Goyal and Cynthia Gao and Vishrav Chaudhary and Peng. The. CoRR , volume =. 2021 , url =. 2106.03193 , timestamp =

work page arXiv 2021

[2] [2]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6399 , number=

work page doi:10.1609/aaai.v34i05.6399 2020

[3] [3]

PIQA: Reasoning about physical commonsense in natural language

PIQA: Reasoning about Physical Commonsense in Natural Language , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6239 , number=

work page doi:10.1609/aaai.v34i05.6239 2020

[4] [4]

Neural Network Ac- ceptability Judgments,

Neural Network Acceptability Judgments , author=. arXiv preprint 1805.12471 , year=

work page arXiv

[5] [5]

Biochimica et Biophysica Acta (BBA)-Protein Structure , volume=

Comparison of the predicted and observed secondary structure of T4 phage lysozyme , author=. Biochimica et Biophysica Acta (BBA)-Protein Structure , volume=. 1975 , publisher=

work page 1975

[6] [6]

Character-level convolutional networks for text classification , author=

work page

[7] [7]

Quantifying the Carbon Emissions of Machine Learning

Quantifying the Carbon Emissions of Machine Learning , author=. arXiv preprint arXiv:1910.09700 , year=

work page internal anchor Pith review arXiv 1910

[8] [8]

arXiv preprint arXiv:2003.11942 , year=

Towards backward-compatible representation learning , author=. arXiv preprint arXiv:2003.11942 , year=

work page arXiv 2003

[9] [9]

2020 , eprint=

Training with Quantization Noise for Extreme Model Compression , author=. 2020 , eprint=

work page 2020

[10] [10]

International Conference on Learning Representations , year=

What do you learn from context? Probing for sentence structure in contextualized word representations , author=. International Conference on Learning Representations , year=

work page

[11] [11]

arXiv preprint arXiv:1905.05950 , year=

BERT rediscovers the classical NLP pipeline , author=. arXiv preprint arXiv:1905.05950 , year=

work page arXiv 1905

[12] [12]

Multi-task sequence to sequence learning , author=

work page

[13] [13]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

work page

[14] [14]

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

Bam! born-again multi-task networks for natural language understanding , author=. arXiv preprint arXiv:1907.04829 , year=

work page Pith review arXiv 1907

[15] [15]

Machine learning , volume=

Multitask learning , author=. Machine learning , volume=. 1997 , publisher=

work page 1997

[16] [16]

An Overview of Multi-Task Learning in Deep Neural Networks

An overview of multi-task learning in deep neural networks , author=. arXiv preprint arXiv:1706.05098 , year=

work page internal anchor Pith review arXiv

[17] [17]

Proceedings of the 42nd annual meeting on Association for Computational Linguistics , pages=

A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts , author=. Proceedings of the 42nd annual meeting on Association for Computational Linguistics , pages=. 2004 , organization=

work page 2004

[18] [18]

Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media , pages=

A survey on hate speech detection using natural language processing , author=. Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media , pages=

work page

[19] [19]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Parameter-Efficient Transfer Learning for NLP , author=

work page

[21] [21]

Proceedings of the 13th International Workshop on Semantic Evaluation , pages=

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , author=. Proceedings of the 13th International Workshop on Semantic Evaluation , pages=

work page 2019

[22] [22]

Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , booktitle=

work page

[23] [23]

The Natural Language Decathlon: Multitask Learning as Question Answering

The Natural Language Decathlon: Multitask Learning as Question Answering , author=. arXiv preprint arXiv:1806.08730 , year=

work page Pith review arXiv

[24] [24]

Proceedings of the 25th international conference on Machine learning , pages=

A unified architecture for natural language processing: Deep neural networks with multitask learning , author=. Proceedings of the 25th international conference on Machine learning , pages=. 2008 , organization=

work page 2008

[25] [25]

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=

Humor recognition and humor anchor extraction , author=. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2015

[26] [26]

Proceedings of the conference on empirical methods in natural language processing , pages=

Revisiting readability: A unified framework for predicting text quality , author=. Proceedings of the conference on empirical methods in natural language processing , pages=. 2008 , organization=

work page 2008

[27] [27]

Weld and Luke Zettlemoyer and Omer Levy , year=

Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy , year=

work page

[28] [28]

Zhang, Zhengyan and Han, Xu and Liu, Zhiyuan and Jiang, Xin and Sun, Maosong and Liu, Qun , booktitle=acl, year=

work page

[29] [29]

Yu Stephanie Sun and Shuohuan Wang and Yukun Li and Shikun Feng and Xuyi Chen and Han Zhang and Xinlun Tian and Danxiang Zhu and Hao Tian and Hua Wu , journal=

work page

[30] [30]

International Conference on Learning Representations , year=

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , author=. International Conference on Learning Representations , year=

work page

[31] [31]

Advances in neural information processing systems , pages=

Skip-thought vectors , author=. Advances in neural information processing systems , pages=

work page

[32] [32]

Learning Distributed Representations of Sentences from Unlabelled Data

Hill, Felix and Cho, Kyunghyun and Korhonen, Anna. Learning Distributed Representations of Sentences from Unlabelled Data. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. doi:10.18653/v1/N16-1162

work page doi:10.18653/v1/n16-1162 2016

[33] [33]

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , booktitle =

Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Lo\". Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , booktitle =. 2017 , address =

work page 2017

[34] [34]

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks

To tune or not to tune? adapting pretrained representations to diverse tasks , author=. arXiv preprint arXiv:1903.05987 , year=

work page Pith review arXiv 1903

[35] [35]

Uniﬁed language model pre- training for natural language understanding and gen- eration

Unified Language Model Pre-training for Natural Language Understanding and Generation , author=. arXiv preprint arXiv:1905.03197 , year=

work page arXiv 1905

[36] [36]

Chan, William and Kitaev, Nikita and Guu, Kelvin and Stern, Mitchell and Uszkoreit, Jakob , journal=

work page

[37] [37]

Learned in translation: Contextualized word vectors , author=

work page

[38] [38]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=naacl, year=

work page

[39] [39]

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding , author=. arXiv preprint arXiv:1906.08237 , year=

work page internal anchor Pith review arXiv 1906

[40] [41]

Cloze-driven Pretraining of Self-attention Networks

Cloze-driven pretraining of self-attention networks , author=. arXiv preprint arXiv:1903.07785 , year=

work page Pith review arXiv 1903

[41] [42]

International Conference on Learning Representations , year=

Adaptive Input Representations for Neural Language Modeling , author=. International Conference on Learning Representations , year=

work page

[42] [43]

Generating Long Sequences with Sparse Transformers

Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904

[43] [44]

OpenWebText Corpus , author=

work page

[44] [45]

A Fair Comparison Study of XLNet and BERT with Large Models , author=

work page

[45] [46]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page internal anchor Pith review arXiv 1904

[46] [47]

One weird trick for parallelizing convolutional neural networks

One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=

work page Pith review arXiv

[47] [48]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page

[48] [49]

First Quora Dataset Release: Question Pairs , author=

work page

[49] [50]

Sara Bergman , howpublished=

work page

[50] [51]

2017 , booktitle =

Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela , title =. 2017 , booktitle =

work page 2017

[51] [52]

Defending against neural fake news

Defending Against Neural Fake News , author=. arXiv preprint arXiv:1905.12616 , year=

work page arXiv 1905

[52] [53]

, booktitle=iclr, year=

Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , booktitle=iclr, year=

work page

[53] [54]

and Schwenk, Holger and Stoyanov, Veselin

Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin. XNLI: Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

work page 2018

[54] [55]

Bowman , journal=

Alex Wang and Yada Pruksachatkun and Nikita Nangia and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman , journal=. Super

work page

[55] [56]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=

work page

[56] [57]

De Marneffe, Marie-Catherine and Simons, Mandy and Tonhauser, Judith , note=

work page

[57] [58]

2011 AAAI Spring Symposium Series , year=

Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

work page 2011

[58] [59]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

work page 2018

[59] [60]

Sheng Zhang and Xiaodong Liu and Jingjing Liu and Jianfeng Gao and Kevin Duh and Benjamin Van Durme , journal=

work page

[60] [61]

Dagan, Ido and Glickman, Oren and Magnini, Bernardo , booktitle=. The. 2006 , publisher=

work page 2006

[61] [62]

The second

Bar Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , year=. The second

work page

[62] [63]

The third

Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill , booktitle=. The third. 2007 , organization=

work page 2007

[63] [64]

The Fifth

Bentivogli, Luisa and Dagan, Ido and Dang, Hoa Trang and Giampiccolo, Danilo and Magnini, Bernardo , booktitle=. The Fifth

work page

[64] [65]

Pilehvar, Mohammad Taher and Camacho-Collados, Jose , booktitle=

work page

[65] [66]

Proceedings of NAACL-HLT , year=

Gender Bias in Coreference Resolution , author=. Proceedings of NAACL-HLT , year=

work page

[66] [67]

Proceedings of EMNLP , year=

Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , author=. Proceedings of EMNLP , year=

work page

[67] [68]

Levesque, Hector J and Davis, Ernest and Morgenstern, Leora , booktitle=. The

work page

[68] [69]

Automatic Differentiation in

Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam , booktitle=. Automatic Differentiation in

work page

[69] [70]

Neural Machine Translation of Rare Words with Subword Units , author=

work page

[70] [71]

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , author=. arXiv preprint arXiv:1904.09482 , year=

work page Pith review arXiv 1904

[71] [72]

A surprisingly robust trick for winograd schema challenge

A Surprisingly Robust Trick for Winograd Schema Challenge , author=. arXiv preprint arXiv:1905.06290 , year=

work page arXiv 1905

[72] [73]

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , author=. arXiv preprint arXiv:1811.01088 , year=

work page Pith review arXiv

[73] [74]

2017 , Note =

Honnibal, Matthew and Montani, Ines , TITLE =. 2017 , Note =

work page 2017

[74] [75]

International Conference on Learning Representations , year=

Mixed Precision Training , author=. International Conference on Learning Representations , year=

work page

[75] [76]

Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli , booktitle = naacl_demo, year =

work page

[76] [78]

How to ﬁne-tune bert for text classiﬁcation?arXiv preprint arXiv:1905.05583, 2019

How to Fine-Tune BERT for Text Classification? , author=. arXiv preprint arXiv:1905.05583 , year=

work page arXiv 1905

[77] [79]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author =

work page

[78] [80]

5th Workshop on Energy Efficient Machine Learning and Cognitive Computing , year=

Q8bert: Quantized 8bit bert , author=. 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing , year=

work page

[79] [81]

Multi-Task Deep Neural Networks for Natural Language Understanding

Multi-Task Deep Neural Networks for Natural Language Understanding , author=. arXiv preprint arXiv:1901.11504 , year=

work page Pith review arXiv 1901

[80] [82]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv