arxiv: 2205.01068 · v4 · submitted 2022-05-02 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

OPT: Open Pre-trained Transformer Language Models

Susan Zhang , Stephen Roller , Naman Goyal , Mikel Artetxe , Moya Chen , Shuohui Chen , Christopher Dewan , Mona Diab

show 11 more authors

Xian Li Xi Victoria Lin Todor Mihaylov Myle Ott Sam Shleifer Kurt Shuster Daniel Simig Punit Singh Koura Anjali Sridhar Tianlu Wang Luke Zettlemoyer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:48 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords large language modelsopen sourcepre-trained transformersGPT-3carbon footprintdecoder-only modelsfew-shot learningmodel release

0 comments

The pith

A suite of open decoder-only transformer models up to 175B parameters matches GPT-3 performance while using only one-seventh the carbon footprint.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a collection of pre-trained language models called OPT that range in size from 125 million to 175 billion parameters. These models are made available with full weights and training code to allow broad research access. The central demonstration is that the largest version performs similarly to the closed GPT-3 model but requires substantially less energy and emissions to train. This matters because it lowers the barrier for studying and improving large language models beyond a small number of organizations with massive resources.

Core claim

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We

What carries the argument

The OPT suite, a collection of openly released decoder-only pre-trained transformer language models ranging from 125M to 175B parameters that includes full weights, training code, and infrastructure logs.

If this is right

Researchers can directly access and modify the full model weights instead of relying on restricted APIs.
Large-scale language model development becomes feasible with substantially lower carbon emissions.
The released code allows experimentation across the full range of model sizes from 125M to 175B parameters.
Infrastructure logs provide concrete details on challenges encountered during training of these models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread access to the weights could enable more groups to test safety and bias mitigation techniques on models of this scale.
Lower training costs may support repeated fine-tuning cycles that were previously impractical for non-industry labs.
The open release creates a direct path for third parties to verify the reported performance and emissions numbers.

Load-bearing premise

That the benchmarks and evaluation protocols used to establish comparability between OPT-175B and GPT-3 are fair, comprehensive, and not affected by differences in training data or optimization details.

What would settle it

An independent run of OPT-175B on the same zero- and few-shot benchmarks as GPT-3 that shows a clear performance gap, or a recalculation of training emissions that exceeds one-seventh of the GPT-3 figure.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the OPT suite of decoder-only pre-trained transformer language models with sizes ranging from 125M to 175B parameters. The central claims are that OPT-175B achieves performance comparable to GPT-3 across zero- and few-shot tasks while requiring only 1/7th the carbon footprint to develop, and that the models, training logbook, and code will be released to enable broader research.

Significance. If the performance and carbon claims hold under transparent evaluation, the work is significant for lowering barriers to studying large language models by providing open weights and infrastructure details. The release of code and logs supports reproducibility, and the carbon reduction highlights practical efficiencies in training at scale.

major comments (2)

Abstract and carbon footprint section: The headline claim that OPT-175B requires only 1/7th the carbon footprint of GPT-3 depends on an external third-party estimate for GPT-3 emissions. The manuscript must include a side-by-side table or explicit comparison of all assumptions (TDP, PUE, hardware utilization, effective FLOPs per token, and cluster power draw) used for both models; without this, the scalar ratio is not robust or independently verifiable from the OPT measurements alone.
Evaluation section (results tables): The statement of comparability to GPT-3 is load-bearing but presented without error bars, run-to-run variance, or a complete list of tasks and exact scores in a single consolidated table. This makes it difficult to assess whether differences are statistically meaningful or affected by training data/optimization details, as noted in the weakest assumption.

minor comments (2)

The logbook release is a strength for transparency; however, it would benefit from an index or summary table mapping challenges to specific training stages or model sizes.
Notation for model sizes (e.g., OPT-175B) is clear, but ensure all hyperparameter tables in the appendix explicitly list learning rate schedules, batch sizes, and data mixtures for each scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the planned revisions.

read point-by-point responses

Referee: Abstract and carbon footprint section: The headline claim that OPT-175B requires only 1/7th the carbon footprint of GPT-3 depends on an external third-party estimate for GPT-3 emissions. The manuscript must include a side-by-side table or explicit comparison of all assumptions (TDP, PUE, hardware utilization, effective FLOPs per token, and cluster power draw) used for both models; without this, the scalar ratio is not robust or independently verifiable from the OPT measurements alone.

Authors: We agree that a transparent comparison of assumptions is necessary to support the carbon claim. In the revised manuscript we will add a side-by-side table in the carbon footprint section that explicitly lists TDP, PUE, hardware utilization, effective FLOPs per token, and cluster power draw for both OPT-175B (our measurements) and the GPT-3 estimate. This will allow readers to inspect the basis of the 1/7th ratio directly. revision: yes
Referee: Evaluation section (results tables): The statement of comparability to GPT-3 is load-bearing but presented without error bars, run-to-run variance, or a complete list of tasks and exact scores in a single consolidated table. This makes it difficult to assess whether differences are statistically meaningful or affected by training data/optimization details, as noted in the weakest assumption.

Authors: We acknowledge that a consolidated table improves clarity. Due to the prohibitive cost of training at this scale we performed only a single run for OPT-175B and therefore cannot supply run-to-run variance or error bars. We will revise the evaluation section to present all zero- and few-shot results in one consolidated table with exact scores for every task, and we will add explicit text noting the single-run limitation and its implications for statistical comparison. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training results and external benchmark comparisons

full rationale

The paper presents direct empirical results from training decoder-only transformers (125M to 175B parameters) and evaluates them on standard zero- and few-shot benchmarks against GPT-3. The carbon-footprint comparison (1/7th) relies on an external third-party estimate for GPT-3 rather than any self-derived quantity or fitted parameter. No equations, ansatzes, uniqueness theorems, or self-citations reduce claims to inputs by construction; the derivation chain consists of reported training runs and external references, remaining self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This review is based solely on the abstract; full methods, hyperparameters, data, and evaluation details are unavailable. No free parameters, axioms, or invented entities can be audited from the provided text.

pith-pipeline@v0.9.0 · 5503 in / 1023 out tokens · 37177 ms · 2026-05-10T20:48:21.628279+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Instruction Tuning with GPT-4
cs.CL 2023-04 unverdicted novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
Code as Policies: Language Model Programs for Embodied Control
cs.RO 2022-09 accept novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
cs.LG 2026-05 conditional novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
cs.LG 2026-05 unverdicted novelty 7.0

Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
cs.LG 2026-05 accept novelty 7.0

Apple MPS decoding exhibits non-monotonic latency with spikes up to 21x due to KV cache interactions and execution regimes, unlike monotonic behavior on CPU and CUDA.
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
cs.LG 2026-05 unverdicted novelty 7.0

Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization
cs.LG 2026-05 unverdicted novelty 7.0

PACZero achieves zero mutual information privacy for LLM fine-tuning via sign-quantized zeroth-order gradients, delivering near-non-private accuracy on SST-2 and SQuAD at I=0.
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation
cs.CL 2026-05 unverdicted novelty 7.0

MTA improves LLM knowledge distillation by aligning representations along layer-wise trajectories with adaptive granularity from words to phrases using dynamic structural and hidden representation alignment losses.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
cs.CL 2026-04 unverdicted novelty 7.0

Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
cs.PF 2026-04 unverdicted novelty 7.0

HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU
cs.AR 2026-04 unverdicted novelty 7.0

A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
cs.AR 2026-04 conditional novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
On the Invariants of Softmax Attention
cs.LG 2026-04 unverdicted novelty 7.0

Softmax attention has algebraic invariants including zero-sum rows and head-dimension rank limits, plus consistent variance spread in language models attributed to key incoherence.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
cs.CL 2024-12 unverdicted novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
cs.CV 2023-10 accept novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
Efficient Memory Management for Large Language Model Serving with PagedAttention
cs.LG 2023-09 conditional novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
Steering Language Models With Activation Engineering
cs.CL 2023-08 unverdicted novelty 7.0

Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
cs.AI 2023-04 accept novelty 7.0

LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
cs.CV 2023-01 unverdicted novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
cs.LG 2022-10 unverdicted novelty 7.0

GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Quantifying Memorization Across Neural Language Models
cs.LG 2022-02 unverdicted novelty 7.0

Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
Instructions Shape Production of Language, not Processing
cs.CL 2026-05 unverdicted novelty 6.0

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask
cs.LG 2026-05 unverdicted novelty 6.0

SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression
cs.LG 2026-05 unverdicted novelty 6.0

DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
cs.CR 2026-05 conditional novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
Learning Dynamics of Zeroth-Order Optimization: A Kernel Perspective
cs.LG 2026-05 unverdicted novelty 6.0

Zeroth-order SGD learning dynamics are governed by a random low-dimensional projection of the empirical NTK whose approximation error scales with model output dimension, not parameter count.
Gated Subspace Inference for Transformer Acceleration
cs.LG 2026-05 unverdicted novelty 6.0

Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking
cs.CR 2026-05 unverdicted novelty 6.0

BREW achieves TPR of 0.965 and FPR of 0.02 under 10% synonym substitution by shifting from ECC decoding to designated verification with block voting and local validation.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
cs.DC 2026-04 unverdicted novelty 6.0

COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
cs.DC 2026-04 unverdicted novelty 6.0

DUAL-BLADE uses a dual-path KV-cache framework with NVMe-direct access to reduce prefill and decode latency by up to 33% and 42% while improving SSD utilization 2.2x under tight memory budgets.
AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices
cs.AR 2026-04 unverdicted novelty 6.0

AHASD is a new asynchronous heterogeneous architecture for mobile NPU-PIM systems that enables efficient adaptive speculative decoding for LLMs by decoupling drafting and verification with specialized controls and har...
Animator-Centric Skeleton Generation on Objects with Fine-Grained Details
cs.GR 2026-04 unverdicted novelty 6.0

An animator-centric skeleton generation method that uses semantic-aware tokenization and a learnable density interval module to produce controllable, high-quality skeletons on complex 3D meshes.
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
cs.CL 2026-04 unverdicted novelty 6.0

SHADE adaptively combines coverage and spectral signals to estimate semantic alphabet size from few LLM samples, yielding better performance than baselines in low-sample regimes for alphabet estimation and QA error detection.
Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling
cs.LG 2026-04 unverdicted novelty 6.0

AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbi...
Representation-Guided Parameter-Efficient LLM Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Federated User Behavior Modeling for Privacy-Preserving LLM Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

SF-UBM enables privacy-preserving cross-domain LLM recommendation by federating semantic item representations, distilling domain knowledge, and aligning preferences into LLM soft prompts.
Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

CoUR uses LLMs for efficient RL reward design through uncertainty quantification and similarity selection, achieving better performance and lower evaluation costs on IsaacGym and Bidexterous Manipulation benchmarks.
BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation
cs.NE 2026-04 unverdicted novelty 6.0

BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware d...
Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

Adaptive Stealing improves watermark theft efficiency from LLMs via Position-Based Seal Construction and Adaptive Selection modules that dynamically choose optimal attack perspectives.
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
cs.CV 2026-04 unverdicted novelty 6.0

LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM
cs.AR 2026-04 conditional novelty 6.0

LOCALUT delivers 1.82x geometric mean speedup for quantized DNN inference on real UPMEM DRAM-PIM devices by using operation-packed LUTs with canonicalization, reordering, and slice streaming.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
cs.LG 2026-03 conditional novelty 6.0

HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
cs.CV 2026-03 unverdicted novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

Reference graph

Works this paper leans on

294 extracted references · 200 canonical work pages · cited by 95 Pith papers · 31 internal anchors

[1]

Naman Goyal and Cynthia Gao and Vishrav Chaudhary and Peng. The. CoRR , volume =. 2021 , url =. 2106.03193 , timestamp =

work page arXiv 2021
[2]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6399 , number=

work page doi:10.1609/aaai.v34i05.6399 2020
[3]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

PIQA: Reasoning about Physical Commonsense in Natural Language , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6239 , number=

work page doi:10.1609/aaai.v34i05.6239 2020
[4]

arXiv preprint arXiv:1805.12471 , year=

Neural Network Acceptability Judgments , author=. arXiv preprint 1805.12471 , year=

work page arXiv
[5]

Biochimica et Biophysica Acta (BBA)-Protein Structure , volume=

Comparison of the predicted and observed secondary structure of T4 phage lysozyme , author=. Biochimica et Biophysica Acta (BBA)-Protein Structure , volume=. 1975 , publisher=

1975
[6]

Character-level convolutional networks for text classification , author=

work page
[7]

Quantifying the Carbon Emissions of Machine Learning

Quantifying the Carbon Emissions of Machine Learning , author=. arXiv preprint arXiv:1910.09700 , year=

work page internal anchor Pith review arXiv 1910
[8]

arXiv preprint arXiv:2003.11942 , year=

Towards backward-compatible representation learning , author=. arXiv preprint arXiv:2003.11942 , year=

work page arXiv 2003
[9]

2020 , eprint=

Training with Quantization Noise for Extreme Model Compression , author=. 2020 , eprint=

2020
[10]

International Conference on Learning Representations , year=

What do you learn from context? Probing for sentence structure in contextualized word representations , author=. International Conference on Learning Representations , year=

work page
[11]

BERT Rediscovers the Classical NLP Pipeline , publisher =

BERT rediscovers the classical NLP pipeline , author=. arXiv preprint arXiv:1905.05950 , year=

work page arXiv 1905
[12]

Multi-task sequence to sequence learning , author=

work page
[13]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[14]

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

Bam! born-again multi-task networks for natural language understanding , author=. arXiv preprint arXiv:1907.04829 , year=

work page Pith review arXiv 1907
[15]

Machine learning , volume=

Multitask learning , author=. Machine learning , volume=. 1997 , publisher=

1997
[16]

An Overview of Multi-Task Learning in Deep Neural Networks

An overview of multi-task learning in deep neural networks , author=. arXiv preprint arXiv:1706.05098 , year=

work page internal anchor Pith review arXiv
[17]

Proceedings of the 42nd annual meeting on Association for Computational Linguistics , pages=

A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts , author=. Proceedings of the 42nd annual meeting on Association for Computational Linguistics , pages=. 2004 , organization=

2004
[18]

Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media , pages=

A survey on hate speech detection using natural language processing , author=. Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media , pages=
[19]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Parameter-Efficient Transfer Learning for NLP , author=
[21]

Proceedings of the 13th International Workshop on Semantic Evaluation , pages=

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , author=. Proceedings of the 13th International Workshop on Semantic Evaluation , pages=

work page 2019
[22]

Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , booktitle=

work page
[23]

The Natural Language Decathlon: Multitask Learning as Question Answering

The Natural Language Decathlon: Multitask Learning as Question Answering , author=. arXiv preprint arXiv:1806.08730 , year=

work page Pith review arXiv
[24]

Proceedings of the 25th international conference on Machine learning , pages=

A unified architecture for natural language processing: Deep neural networks with multitask learning , author=. Proceedings of the 25th international conference on Machine learning , pages=. 2008 , organization=

work page 2008
[25]

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=

Humor recognition and humor anchor extraction , author=. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2015
[26]

Proceedings of the conference on empirical methods in natural language processing , pages=

Revisiting readability: A unified framework for predicting text quality , author=. Proceedings of the conference on empirical methods in natural language processing , pages=. 2008 , organization=

work page 2008
[27]

Weld and Luke Zettlemoyer and Omer Levy , year=

Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy , year=

work page
[28]

Zhang, Zhengyan and Han, Xu and Liu, Zhiyuan and Jiang, Xin and Sun, Maosong and Liu, Qun , booktitle=acl, year=

work page
[29]

Yu Stephanie Sun and Shuohuan Wang and Yukun Li and Shikun Feng and Xuyi Chen and Han Zhang and Xinlun Tian and Danxiang Zhu and Hao Tian and Hua Wu , journal=

work page
[30]

International Conference on Learning Representations , year=

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , author=. International Conference on Learning Representations , year=

work page
[31]

Advances in neural information processing systems , pages=

Skip-thought vectors , author=. Advances in neural information processing systems , pages=
[32]

Learning Distributed Representations of Sentences from Unlabelled Data

Hill, Felix and Cho, Kyunghyun and Korhonen, Anna. Learning Distributed Representations of Sentences from Unlabelled Data. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. doi:10.18653/v1/N16-1162

work page doi:10.18653/v1/n16-1162 2016
[33]

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , booktitle =

Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Lo\". Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , booktitle =. 2017 , address =

2017
[34]

arXiv preprint arXiv:1903.05987 , year=

To tune or not to tune? adapting pretrained representations to diverse tasks , author=. arXiv preprint arXiv:1903.05987 , year=

work page arXiv 1903
[35]

Uniﬁed language model pre- training for natural language understanding and gen- eration

Unified Language Model Pre-training for Natural Language Understanding and Generation , author=. arXiv preprint arXiv:1905.03197 , year=

work page arXiv 1905
[36]

Chan, William and Kitaev, Nikita and Guu, Kelvin and Stern, Mitchell and Uszkoreit, Jakob , journal=
[37]

Learned in translation: Contextualized word vectors , author=
[38]

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=naacl, year=
[39]

Xlnet: Generalized autoregressive pretraining for language understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding , author=. arXiv preprint arXiv:1906.08237 , year=

work page arXiv 1906
[41]

Cloze-driven Pretraining of Self-attention Networks

Cloze-driven pretraining of self-attention networks , author=. arXiv preprint arXiv:1903.07785 , year=

work page Pith review arXiv 1903
[42]

International Conference on Learning Representations , year=

Adaptive Input Representations for Neural Language Modeling , author=. International Conference on Learning Representations , year=
[43]

Generating Long Sequences with Sparse Transformers

Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[44]

OpenWebText Corpus , author=
[45]

A Fair Comparison Study of XLNet and BERT with Large Models , author=
[46]

In: International Conference on Learning Representations (ICLR) 2020 (2020).https://arxiv.org/abs/1904.00962

Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , author=. arXiv preprint arXiv:1904.00962 , year=

work page arXiv 1904
[47]

arXiv preprint arXiv:1404.5997 , year=

One weird trick for parallelizing convolutional neural networks , author=. arXiv preprint arXiv:1404.5997 , year=

work page arXiv
[48]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
[49]

First Quora Dataset Release: Question Pairs , author=
[50]

Sara Bergman , howpublished=
[51]

2017 , booktitle =

Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela , title =. 2017 , booktitle =

2017
[52]

Defending against neural fake news.arXiv preprint arXiv:1905.12616,

Defending Against Neural Fake News , author=. arXiv preprint arXiv:1905.12616 , year=

work page arXiv 1905
[53]

, booktitle=iclr, year=

Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R. , booktitle=iclr, year=
[54]

and Schwenk, Holger and Stoyanov, Veselin

Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin. XNLI: Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

2018
[55]

Bowman , journal=

Alex Wang and Yada Pruksachatkun and Nikita Nangia and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman , journal=. Super
[56]

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle=
[57]

De Marneffe, Marie-Catherine and Simons, Mandy and Tonhauser, Judith , note=
[58]

2011 AAAI Spring Symposium Series , year=

Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=

work page 2011
[59]

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

work page 2018
[60]

Sheng Zhang and Xiaodong Liu and Jingjing Liu and Jianfeng Gao and Kevin Duh and Benjamin Van Durme , journal=

work page
[61]

Dagan, Ido and Glickman, Oren and Magnini, Bernardo , booktitle=. The. 2006 , publisher=

work page 2006
[62]

The second

Bar Haim, Roy and Dagan, Ido and Dolan, Bill and Ferro, Lisa and Giampiccolo, Danilo and Magnini, Bernardo and Szpektor, Idan , year=. The second

work page
[63]

The third

Giampiccolo, Danilo and Magnini, Bernardo and Dagan, Ido and Dolan, Bill , booktitle=. The third. 2007 , organization=

work page 2007
[64]

The Fifth

Bentivogli, Luisa and Dagan, Ido and Dang, Hoa Trang and Giampiccolo, Danilo and Magnini, Bernardo , booktitle=. The Fifth
[65]

Pilehvar, Mohammad Taher and Camacho-Collados, Jose , booktitle=
[66]

Proceedings of NAACL-HLT , year=

Gender Bias in Coreference Resolution , author=. Proceedings of NAACL-HLT , year=
[67]

Proceedings of EMNLP , year=

Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , author=. Proceedings of EMNLP , year=
[68]

Levesque, Hector J and Davis, Ernest and Morgenstern, Leora , booktitle=. The
[69]

Automatic Differentiation in

Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam , booktitle=. Automatic Differentiation in
[70]

Neural Machine Translation of Rare Words with Subword Units , author=
[71]

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , author=. arXiv preprint arXiv:1904.09482 , year=

work page Pith review arXiv 1904
[72]

A surprisingly robust trick for Winograd schema challenge.arXiv preprint arXiv:1905.06290,

A Surprisingly Robust Trick for Winograd Schema Challenge , author=. arXiv preprint arXiv:1905.06290 , year=

work page arXiv 1905
[73]

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , author=. arXiv preprint arXiv:1811.01088 , year=

work page Pith review arXiv
[74]

2017 , Note =

Honnibal, Matthew and Montani, Ines , TITLE =. 2017 , Note =

2017
[75]

International Conference on Learning Representations , year=

Mixed Precision Training , author=. International Conference on Learning Representations , year=
[76]

Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli , booktitle = naacl_demo, year =
[78]

How to Fine-Tune BERT for Text Classification?https://arxiv.org/ pdf/1905.05583

How to Fine-Tune BERT for Text Classification? , author=. arXiv preprint arXiv:1905.05583 , year=

work page arXiv 1905
[79]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author =
[80]

5th Workshop on Energy Efficient Machine Learning and Cognitive Computing , year=

Q8bert: Quantized 8bit bert , author=. 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing , year=
[81]

Multi-Task Deep Neural Networks for Natural Language Understanding

Multi-Task Deep Neural Networks for Natural Language Understanding , author=. arXiv preprint arXiv:1901.11504 , year=

work page Pith review arXiv 1901
[82]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.