OPT: Open Pre-trained Transformer Language Models
Pith reviewed 2026-05-10 20:48 UTC · model grok-4.3
The pith
A suite of open decoder-only transformer models up to 175B parameters matches GPT-3 performance while using only one-seventh the carbon footprint.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We
What carries the argument
The OPT suite, a collection of openly released decoder-only pre-trained transformer language models ranging from 125M to 175B parameters that includes full weights, training code, and infrastructure logs.
If this is right
- Researchers can directly access and modify the full model weights instead of relying on restricted APIs.
- Large-scale language model development becomes feasible with substantially lower carbon emissions.
- The released code allows experimentation across the full range of model sizes from 125M to 175B parameters.
- Infrastructure logs provide concrete details on challenges encountered during training of these models.
Where Pith is reading between the lines
- Widespread access to the weights could enable more groups to test safety and bias mitigation techniques on models of this scale.
- Lower training costs may support repeated fine-tuning cycles that were previously impractical for non-industry labs.
- The open release creates a direct path for third parties to verify the reported performance and emissions numbers.
Load-bearing premise
That the benchmarks and evaluation protocols used to establish comparability between OPT-175B and GPT-3 are fair, comprehensive, and not affected by differences in training data or optimization details.
What would settle it
An independent run of OPT-175B on the same zero- and few-shot benchmarks as GPT-3 that shows a clear performance gap, or a recalculation of training emissions that exceeds one-seventh of the GPT-3 figure.
read the original abstract
Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the OPT suite of decoder-only pre-trained transformer language models with sizes ranging from 125M to 175B parameters. The central claims are that OPT-175B achieves performance comparable to GPT-3 across zero- and few-shot tasks while requiring only 1/7th the carbon footprint to develop, and that the models, training logbook, and code will be released to enable broader research.
Significance. If the performance and carbon claims hold under transparent evaluation, the work is significant for lowering barriers to studying large language models by providing open weights and infrastructure details. The release of code and logs supports reproducibility, and the carbon reduction highlights practical efficiencies in training at scale.
major comments (2)
- Abstract and carbon footprint section: The headline claim that OPT-175B requires only 1/7th the carbon footprint of GPT-3 depends on an external third-party estimate for GPT-3 emissions. The manuscript must include a side-by-side table or explicit comparison of all assumptions (TDP, PUE, hardware utilization, effective FLOPs per token, and cluster power draw) used for both models; without this, the scalar ratio is not robust or independently verifiable from the OPT measurements alone.
- Evaluation section (results tables): The statement of comparability to GPT-3 is load-bearing but presented without error bars, run-to-run variance, or a complete list of tasks and exact scores in a single consolidated table. This makes it difficult to assess whether differences are statistically meaningful or affected by training data/optimization details, as noted in the weakest assumption.
minor comments (2)
- The logbook release is a strength for transparency; however, it would benefit from an index or summary table mapping challenges to specific training stages or model sizes.
- Notation for model sizes (e.g., OPT-175B) is clear, but ensure all hyperparameter tables in the appendix explicitly list learning rate schedules, batch sizes, and data mixtures for each scale.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the planned revisions.
read point-by-point responses
-
Referee: Abstract and carbon footprint section: The headline claim that OPT-175B requires only 1/7th the carbon footprint of GPT-3 depends on an external third-party estimate for GPT-3 emissions. The manuscript must include a side-by-side table or explicit comparison of all assumptions (TDP, PUE, hardware utilization, effective FLOPs per token, and cluster power draw) used for both models; without this, the scalar ratio is not robust or independently verifiable from the OPT measurements alone.
Authors: We agree that a transparent comparison of assumptions is necessary to support the carbon claim. In the revised manuscript we will add a side-by-side table in the carbon footprint section that explicitly lists TDP, PUE, hardware utilization, effective FLOPs per token, and cluster power draw for both OPT-175B (our measurements) and the GPT-3 estimate. This will allow readers to inspect the basis of the 1/7th ratio directly. revision: yes
-
Referee: Evaluation section (results tables): The statement of comparability to GPT-3 is load-bearing but presented without error bars, run-to-run variance, or a complete list of tasks and exact scores in a single consolidated table. This makes it difficult to assess whether differences are statistically meaningful or affected by training data/optimization details, as noted in the weakest assumption.
Authors: We acknowledge that a consolidated table improves clarity. Due to the prohibitive cost of training at this scale we performed only a single run for OPT-175B and therefore cannot supply run-to-run variance or error bars. We will revise the evaluation section to present all zero- and few-shot results in one consolidated table with exact scores for every task, and we will add explicit text noting the single-run limitation and its implications for statistical comparison. revision: partial
Circularity Check
No circularity: empirical training results and external benchmark comparisons
full rationale
The paper presents direct empirical results from training decoder-only transformers (125M to 175B parameters) and evaluates them on standard zero- and few-shot benchmarks against GPT-3. The carbon-footprint comparison (1/7th) relies on an external third-party estimate for GPT-3 rather than any self-derived quantity or fitted parameter. No equations, ansatzes, uniqueness theorems, or self-citations reduce claims to inputs by construction; the derivation chain consists of reported training runs and external references, remaining self-contained.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
Instruction Tuning with GPT-4
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
Code as Policies: Language Model Programs for Embodied Control
Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
-
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
-
BioDefect: The First Dataset for Defect Detection in Bioinformatics Software
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
-
Modality-Decoupled Online Recursive Editing
M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.
-
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
-
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
Rank-1 activation steering is often cheap when prompt-boundary alignment guides budgeted search and concept granularity diagnoses directional stability, with the GRACE framework reducing trials to 95% utility by 39.8%...
-
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
-
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Apple MPS decoding exhibits non-monotonic latency with spikes up to 21x due to KV cache interactions and execution regimes, unlike monotonic behavior on CPU and CUDA.
-
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
-
PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization
PACZero achieves zero mutual information privacy for LLM fine-tuning via sign-quantized zeroth-order gradients, delivering near-non-private accuracy on SST-2 and SQuAD at I=0.
-
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation
MTA improves LLM knowledge distillation by aligning representations along layer-wise trajectories with adaptive granularity from words to phrases using dynamic structural and hidden representation alignment losses.
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and toke...
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
-
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
-
From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU
A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.
-
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
-
On the Invariants of Softmax Attention
Softmax attention has algebraic invariants including zero-sum rows and head-dimension rank limits, plus consistent variance spread in language models attributed to key incoherence.
-
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
-
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent g...
-
HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training
HOSL reduces client memory up to 3.7x versus full first-order split learning while staying within 0.20-4.23% accuracy on OPT models by pairing client zeroth-order estimation with server first-order optimization.
-
DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack
DualGuard uses adaptive dual-stream watermark signals to detect and trace both paraphrase and spoofing attacks in LLM outputs while preserving text quality.
-
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data
PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with...
-
All is Not Lost: LLM Recovery without Checkpoints
CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding st...
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
Federated Co-tuning Framework for Large and Small Language Models
FedCoLLM is a parameter-efficient federated co-tuning framework that improves client SLMs via server LLMs and enriches LLMs with client domain insights using adapters on NLP text generation tasks.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Topic-Based Watermarks for Large Language Models
A topic-guided watermarking scheme partitions the LLM vocabulary into topic-aligned token subsets and green-lists relevant tokens based on the input prompt to embed detectable marks while preserving text quality and i...
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Detecting Pretraining Data from Large Language Models
Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers
EvoPrompt uses LLMs to run evolutionary operators on populations of prompts, outperforming human-engineered prompts by up to 25% on BIG-Bench Hard tasks across 31 datasets.
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
-
Steering Language Models With Activation Engineering
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
-
The Curse of Recursion: Training on Generated Data Makes Models Forget
Use of model-generated content in training causes irreversible loss of distribution tails, termed model collapse, in VAEs, GMMs, and LLMs.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
-
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
Quantifying Memorization Across Neural Language Models
Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
-
TimeGuard: Channel-wise Pool Training for Backdoor Defense in Time Series Forecasting
TimeGuard employs channel-wise pool training initialized with time-aware criteria and distance-regularized loss selection to defend time series forecasting against backdoor attacks, improving robustness by 1.96x while...
-
Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies
Self-training restructures language by amplifying surface markers and collapsing deep syntax according to structural depth rather than frequency, as evidenced by correlations across multiple models and a human fine-tu...
-
DP-SelFT: Differentially Private Selective Fine-Tuning for Large Language Models
DP-SelFT improves the privacy-utility trade-off for LLM fine-tuning by selecting robust layer subsets via DP synthetic data and perturbation-matched evaluation.
-
Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning
Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.
-
Instructions Shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
-
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
ZO-MOPI accelerates zeroth-order LLM fine-tuning by applying partial spectral orthogonalization from power iteration inside a momentum-projected subspace to reduce variance and exploit dominant directions.
-
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
-
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask
SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
-
DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression
DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.
-
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
Reference graph
Works this paper leans on
- [1]
-
[2]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
WinoGrande: An Adversarial Winograd Schema Challenge at Scale , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6399 , number=
-
[3]
PIQA: Reasoning about physical commonsense in natural language
PIQA: Reasoning about Physical Commonsense in Natural Language , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6239 , number=
-
[4]
Neural Network Ac- ceptability Judgments,
Neural Network Acceptability Judgments , author=. arXiv preprint 1805.12471 , year=
-
[5]
Biochimica et Biophysica Acta (BBA)-Protein Structure , volume=
Comparison of the predicted and observed secondary structure of T4 phage lysozyme , author=. Biochimica et Biophysica Acta (BBA)-Protein Structure , volume=. 1975 , publisher=
work page 1975
-
[6]
Character-level convolutional networks for text classification , author=
-
[7]
Quantifying the Carbon Emissions of Machine Learning
Quantifying the Carbon Emissions of Machine Learning , author=. arXiv preprint arXiv:1910.09700 , year=
work page internal anchor Pith review arXiv 1910
-
[8]
arXiv preprint arXiv:2003.11942 , year=
Towards backward-compatible representation learning , author=. arXiv preprint arXiv:2003.11942 , year=
-
[9]
Training with Quantization Noise for Extreme Model Compression , author=. 2020 , eprint=
work page 2020
-
[10]
International Conference on Learning Representations , year=
What do you learn from context? Probing for sentence structure in contextualized word representations , author=. International Conference on Learning Representations , year=
-
[11]
arXiv preprint arXiv:1905.05950 , year=
BERT rediscovers the classical NLP pipeline , author=. arXiv preprint arXiv:1905.05950 , year=
-
[12]
Multi-task sequence to sequence learning , author=
-
[13]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
-
[14]
BAM! Born-Again Multi-Task Networks for Natural Language Understanding
Bam! born-again multi-task networks for natural language understanding , author=. arXiv preprint arXiv:1907.04829 , year=
work page Pith review arXiv 1907
-
[15]
Multitask learning , author=. Machine learning , volume=. 1997 , publisher=
work page 1997
-
[16]
An Overview of Multi-Task Learning in Deep Neural Networks
An overview of multi-task learning in deep neural networks , author=. arXiv preprint arXiv:1706.05098 , year=
work page internal anchor Pith review arXiv
-
[17]
Proceedings of the 42nd annual meeting on Association for Computational Linguistics , pages=
A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts , author=. Proceedings of the 42nd annual meeting on Association for Computational Linguistics , pages=. 2004 , organization=
work page 2004
-
[18]
A survey on hate speech detection using natural language processing , author=. Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media , pages=
-
[19]
Distilling the Knowledge in a Neural Network
Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Parameter-Efficient Transfer Learning for NLP , author=
-
[21]
Proceedings of the 13th International Workshop on Semantic Evaluation , pages=
SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , author=. Proceedings of the 13th International Workshop on Semantic Evaluation , pages=
work page 2019
-
[22]
Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , booktitle=
-
[23]
The Natural Language Decathlon: Multitask Learning as Question Answering
The Natural Language Decathlon: Multitask Learning as Question Answering , author=. arXiv preprint arXiv:1806.08730 , year=
-
[24]
Proceedings of the 25th international conference on Machine learning , pages=
A unified architecture for natural language processing: Deep neural networks with multitask learning , author=. Proceedings of the 25th international conference on Machine learning , pages=. 2008 , organization=
work page 2008
-
[25]
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=
Humor recognition and humor anchor extraction , author=. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2015
-
[26]
Proceedings of the conference on empirical methods in natural language processing , pages=
Revisiting readability: A unified framework for predicting text quality , author=. Proceedings of the conference on empirical methods in natural language processing , pages=. 2008 , organization=
work page 2008
-
[27]
Weld and Luke Zettlemoyer and Omer Levy , year=
Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy , year=
-
[28]
Zhang, Zhengyan and Han, Xu and Liu, Zhiyuan and Jiang, Xin and Sun, Maosong and Liu, Qun , booktitle=acl, year=
-
[29]
Yu Stephanie Sun and Shuohuan Wang and Yukun Li and Shikun Feng and Xuyi Chen and Han Zhang and Xinlun Tian and Danxiang Zhu and Hao Tian and Hua Wu , journal=
-
[30]
International Conference on Learning Representations , year=
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , author=. International Conference on Learning Representations , year=
-
[31]
Advances in neural information processing systems , pages=
Skip-thought vectors , author=. Advances in neural information processing systems , pages=
-
[32]
Learning Distributed Representations of Sentences from Unlabelled Data
Hill, Felix and Cho, Kyunghyun and Korhonen, Anna. Learning Distributed Representations of Sentences from Unlabelled Data. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. doi:10.18653/v1/N16-1162
-
[33]
Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Lo\". Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , booktitle =. 2017 , address =
work page 2017
-
[34]
To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
To tune or not to tune? adapting pretrained representations to diverse tasks , author=. arXiv preprint arXiv:1903.05987 , year=
work page Pith review arXiv 1903
-
[35]
Unified language model pre- training for natural language understanding and gen- eration
Unified Language Model Pre-training for Natural Language Understanding and Generation , author=. arXiv preprint arXiv:1905.03197 , year=
-
[36]
Chan, William and Kitaev, Nikita and Guu, Kelvin and Stern, Mitchell and Uszkoreit, Jakob , journal=
-
[37]
Learned in translation: Contextualized word vectors , author=
-
[38]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=naacl, year=
-
[39]
XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding , author=. arXiv preprint arXiv:1906.08237 , year=
work page internal anchor Pith review arXiv 1906
-
[41]
Cloze-driven Pretraining of Self-attention Networks
Cloze-driven pretraining of self-attention networks , author=. arXiv preprint arXiv:1903.07785 , year=
work page Pith review arXiv 1903
-
[42]
International Conference on Learning Representations , year=
Adaptive Input Representations for Neural Language Modeling , author=. International Conference on Learning Representations , year=
-
[43]
Generating Long Sequences with Sparse Transformers
Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[44]
OpenWebText Corpus , author=
-
[45]
A Fair Comparison Study of XLNet and BERT with Large Models , author=
-
[46]
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , author=. arXiv preprint arXiv:1904.00962 , year=
work page internal anchor Pith review arXiv 1904
-
[47]
One weird trick for parallelizing convolutional neural networks
One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=
-
[48]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[49]
First Quora Dataset Release: Question Pairs , author=
-
[50]
Sara Bergman , howpublished=
-
[51]
Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela , title =. 2017 , booktitle =
work page 2017
-
[52]
Defending against neural fake news
Defending Against Neural Fake News , author=. arXiv preprint arXiv:1905.12616 , year=
-
[53]
Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , booktitle=iclr, year=
-
[54]
and Schwenk, Holger and Stoyanov, Veselin
Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin. XNLI: Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018
work page 2018
-
[55]
Alex Wang and Yada Pruksachatkun and Nikita Nangia and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman , journal=. Super
-
[56]
Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=
-
[57]
De Marneffe, Marie-Catherine and Simons, Mandy and Tonhauser, Judith , note=
-
[58]
2011 AAAI Spring Symposium Series , year=
Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=
work page 2011
-
[59]
Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=
work page 2018
-
[60]
Sheng Zhang and Xiaodong Liu and Jingjing Liu and Jianfeng Gao and Kevin Duh and Benjamin Van Durme , journal=
-
[61]
Dagan, Ido and Glickman, Oren and Magnini, Bernardo , booktitle=. The. 2006 , publisher=
work page 2006
-
[62]
Bar Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , year=. The second
- [63]
- [64]
-
[65]
Pilehvar, Mohammad Taher and Camacho-Collados, Jose , booktitle=
-
[66]
Proceedings of NAACL-HLT , year=
Gender Bias in Coreference Resolution , author=. Proceedings of NAACL-HLT , year=
-
[67]
Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , author=. Proceedings of EMNLP , year=
-
[68]
Levesque, Hector J and Davis, Ernest and Morgenstern, Leora , booktitle=. The
-
[69]
Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam , booktitle=. Automatic Differentiation in
-
[70]
Neural Machine Translation of Rare Words with Subword Units , author=
-
[71]
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , author=. arXiv preprint arXiv:1904.09482 , year=
work page Pith review arXiv 1904
-
[72]
A surprisingly robust trick for winograd schema challenge
A Surprisingly Robust Trick for Winograd Schema Challenge , author=. arXiv preprint arXiv:1905.06290 , year=
-
[73]
Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks
Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , author=. arXiv preprint arXiv:1811.01088 , year=
- [74]
-
[75]
International Conference on Learning Representations , year=
Mixed Precision Training , author=. International Conference on Learning Representations , year=
-
[76]
Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli , booktitle = naacl_demo, year =
-
[78]
How to fine-tune bert for text classification?arXiv preprint arXiv:1905.05583, 2019
How to Fine-Tune BERT for Text Classification? , author=. arXiv preprint arXiv:1905.05583 , year=
-
[79]
PyTorch: An Imperative Style, High-Performance Deep Learning Library , author =
-
[80]
5th Workshop on Energy Efficient Machine Learning and Cognitive Computing , year=
Q8bert: Quantized 8bit bert , author=. 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing , year=
-
[81]
Multi-Task Deep Neural Networks for Natural Language Understanding
Multi-Task Deep Neural Networks for Natural Language Understanding , author=. arXiv preprint arXiv:1901.11504 , year=
work page Pith review arXiv 1901
-
[82]
Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.