arxiv: 2408.00118 · v3 · submitted 2024-07-31 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Gemma 2: Improving Open Language Models at a Practical Size

Abe Friesen, Alanna Walton, Alek Andreev, Alexandre Ram\'e, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, Anand Rao, Anca Dragan, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Anton Tsitsulin, Armand Joulin, Behnam Neyshabur, Ben Bastian, Bilal Piot, Bobak Shahriari, Bo Wu, Brandon Royal, Cassidy Hardin, Charlie Chen, Charline Le Lan, Chintu Kumar, Chris Perry, Christopher A. Choquette-Choo, Chris Welty, Clement Farabet, Danila Sinopalnikov, David Weinberger, Demis Hassabis, Dimple Vijaykumar, Dominika Rogozi\'nska, D. Sculley, Dustin Herbison, Elena Buchatskaya, Eli Collins, Elisa Bandy, Emma Wang, Erica Moreira, Eric Noland, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Gemma Team: Morgane Riviere, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Pluci\'nska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jean-bastien Grill, Jeanine Banks, Jeff Dean, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joelle Barral, Johan Ferret, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju-yeong Ji, Kareem Mohamed, Kartikeya Badola, Kat Black, Kathleen Kenealy, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Koray Kavukcuoglu, Lars Lowe Sjoesund, Laurent Sifre, Lauren Usui, Lena Heuermann, L\'eonard Hussenot, Leticia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Ludovic Peran, Machel Reid, Manvinder Singh, Mark Iverson, Martin G\"orner, Mateo Wirth, Matt Davidow, Matthew Rahtz, Matthew Watson, Matt Hoffman, Matt Miller, Mat Velloso, Meg Risdal, Mehran Kazemi, Michael Moynihan, Michelle Casbon, Ming Zhang, Minh Giang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nikola Momchev, Nilay Chauhan, Nino Vieillard, Noah Fiedel, Olivier Bachem, Oriol Vinyals, Oscar Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin, Peter Liu, Petko Georgiev, Phil Culliton, Phoebe Kirk, Pier Giuseppe Sessa, Piotr Stanczyk, Pouya Tafti, Pradeep Kuppala, Raia Hadsell, Ramona Comanescu, Ramona Merhej, Ravin Kumar, Reena Jana, Reza Ardeshir Rokni, Rishabh Agarwal, Robert Dadashi, Ryan Mullins, Sabela Ramos, Samaneh Saadat, Sammy Jerome, Sarah Cogan, Sarah Perrin, Sara Mc Carthy, Sebastian Borgeaud, Sebastian Krause, S\'ebastien M. R. Arnold, Sertan Girgin, Shantanu Thakoor, Shengyang Dai, Shreya Pathak, Shruti Garg, Shruti Sheth, Slav Petrov, Sue Ronstrom, Surya Bhupatiraju, Susan Chan, Thomas Mesnard, Timothy Jordan, Ting Yu, Tomas Kocisky, Tom Eccles, Tom Hennigan, Tris Warkentin, Tulsee Doshi, Victor Cotruta, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Zoubin Ghahramani

Pith reviewed 2026-05-10 12:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords open language modelsGemma 2knowledge distillationlocal-global attentiongroup-query attentiontransformer architecturemodel scalingperformance benchmarks

0 comments

The pith

Gemma 2 models achieve leading performance at their sizes through interleaving local-global attention, group-query attention, and knowledge distillation on the smaller variants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Gemma 2 as an updated family of open language models with 2 billion to 27 billion parameters. It incorporates interleaving of local and global attention layers together with group-query attention in the Transformer backbone, while training the 2B and 9B versions via knowledge distillation rather than standard next-token prediction. These changes produce models that lead their size class on benchmarks and remain competitive with models two to three times larger. A reader would care because the work demonstrates concrete ways to extract more capability from models that fit on everyday hardware and can be released openly.

Core claim

The authors establish that applying interleaving local-global attentions and group-query attention across the model family, plus knowledge distillation for the 2B and 9B models, yields the best performance for each size and makes the models competitive alternatives to systems that are two to three times larger.

What carries the argument

The central mechanisms are the interleaving of local and global attention patterns within the Transformer layers combined with group-query attention, along with knowledge distillation applied specifically to the 2 billion and 9 billion parameter models.

If this is right

Open models at practical sizes can now substitute for much larger ones in many applications.
Hardware with modest memory can host capable language models without major quality loss.
Releasing the full range from 2B to 27B parameters widens access to high-performing open systems.
The same set of changes can be tested on future model scales to check if the efficiency pattern holds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may encourage other developers to prioritize attention-pattern changes over simply adding parameters when resources are constrained.
Wider adoption could shift industry focus toward measuring performance per parameter rather than raw scale alone.
If the gains replicate across different training runs, they would support using these modifications as a standard baseline for new open models.

Load-bearing premise

The reported gains in performance come from the listed architectural modifications and the switch to distillation rather than from unreported differences in training data volume, compute budget, or evaluation setup.

What would settle it

A controlled experiment that trains identical model sizes with the same data and compute but removes the local-global interleaving and group-query attention would show whether the performance edge disappears on the same benchmarks.

read the original abstract

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gemma 2 is a practical model release that applies known attention tweaks and distillation to hit strong numbers at 2B-27B scales, but the gains cannot be cleanly credited to those changes.

read the letter

The core takeaway is that Google has shipped open weights for 2B, 9B, and 27B models that reportedly close much of the gap to larger closed systems while fitting on single GPUs or small clusters. They interleave local and global attention, use group-query attention, and train the smaller sizes with distillation rather than plain next-token prediction. The models are released, which is the part that actually moves the needle for most users.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Gemma 2 family of open language models (2B, 9B, and 27B parameters). It applies known Transformer modifications including interleaving local-global attention and group-query attention, and trains the 2B and 9B models via knowledge distillation rather than next-token prediction. The central claim is that the resulting models achieve the best performance for their size and remain competitive with models 2-3 times larger; all models are released openly.

Significance. If the benchmark results are robust, the work supplies practically useful open models that advance the performance frontier at smaller scales, with the public release of weights enabling reproducibility and downstream research. This is a concrete contribution to accessible LLM development.

major comments (2)

[Sections 2–3] Sections 2–3: The architectural changes (interleaved local-global attention, group-query attention) and switch to knowledge distillation for the 2B/9B models are described at a high level, yet no ablation experiments are reported that hold data mixture, token count, and compute fixed while removing one modification at a time. This leaves the attribution of reported benchmark gains to the listed techniques unsecured, as the central performance claim could be driven by undisclosed differences in pretraining data or scale.
[Results section] Results section: Training data is characterized only qualitatively (web, code, math) with no token counts, source proportions, or direct comparison to the Gemma 1 mixture. Without these details or controlled ablations, it is impossible to isolate the contribution of the architectural and distillation choices from data effects, which routinely produce benchmark deltas of the reported magnitude.

minor comments (1)

Ensure all benchmark tables include the exact evaluation protocols, number of runs, and any variance measures so that comparisons to 2–3× larger models can be reproduced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for your review and the constructive feedback on our Gemma 2 manuscript. We address the major comments point by point below, clarifying the scope of our contributions while noting where revisions can strengthen the presentation.

read point-by-point responses

Referee: [Sections 2–3] Sections 2–3: The architectural changes (interleaved local-global attention, group-query attention) and switch to knowledge distillation for the 2B/9B models are described at a high level, yet no ablation experiments are reported that hold data mixture, token count, and compute fixed while removing one modification at a time. This leaves the attribution of reported benchmark gains to the listed techniques unsecured, as the central performance claim could be driven by undisclosed differences in pretraining data or scale.

Authors: We agree that the absence of component-wise ablations with fixed data, tokens, and compute makes it difficult to isolate the contribution of each individual change. The manuscript presents the Gemma 2 models as a practical integration of established techniques (interleaved local-global attention, group-query attention, and knowledge distillation for the smaller variants), with the central contribution being the resulting performance at these scales and the public release of the weights. We did not perform the requested ablations, as they fall outside the primary goal of delivering and evaluating the final models. In revision we will add explicit language in Sections 2–3 stating that performance gains reflect the combined system and that controlled ablations remain an avenue for future work. revision: partial
Referee: [Results section] Results section: Training data is characterized only qualitatively (web, code, math) with no token counts, source proportions, or direct comparison to the Gemma 1 mixture. Without these details or controlled ablations, it is impossible to isolate the contribution of the architectural and distillation choices from data effects, which routinely produce benchmark deltas of the reported magnitude.

Authors: We acknowledge that qualitative descriptions alone leave open the possibility that data differences contribute to the observed gains. Gemma 2 uses an updated mixture that retains the core web, code, and math sources from Gemma 1 while increasing the proportion of high-quality mathematical and code data. Exact token counts and source proportions cannot be released for proprietary and competitive reasons. In the revised manuscript we will expand the data description in the Results section to include a qualitative comparison with the Gemma 1 mixture and to note that the architectural and distillation choices were applied on top of this updated data regime. revision: partial

standing simulated objections not resolved

Exact token counts, source proportions, and quantitative comparison tables for the pretraining data mixture, which cannot be disclosed due to proprietary constraints.

Circularity Check

0 steps flagged

No derivation chain present; empirical model release

full rationale

The paper introduces Gemma 2 models by describing the application of established techniques (interleaved local-global attention, group-query attention, and knowledge distillation) and reports benchmark performance. No equations, predictions, or first-principles derivations are claimed or present in the provided text. All cited methods are external (Beltagy et al., Ainslie et al., Hinton et al.), and results are measured against independent benchmarks with models released openly. The work contains no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard Transformer assumptions plus the effectiveness of the cited modifications; no new entities are postulated and the free parameters are the usual training hyperparameters and data choices that are not enumerated in the abstract.

free parameters (2)

model scale choices
Selection of 2B, 9B, and 27B parameter counts as practical sizes
training hyperparameters
Learning rates, batch sizes, and distillation temperatures not specified in abstract

axioms (2)

domain assumption Standard Transformer attention and feed-forward blocks remain effective when modified with local-global interleaving and group-query attention
Invoked by citing Beltagy et al. and Ainslie et al. without re-derivation
domain assumption Knowledge distillation improves smaller models over next-token prediction alone
Cited from Hinton et al. and applied to 2B/9B variants

pith-pipeline@v0.9.0 · 6321 in / 1432 out tokens · 29269 ms · 2026-05-10T12:06:11.309894+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Masked Generative Transformer Is What You Need for Image Editing
cs.CV 2026-05 unverdicted novelty 8.0

EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.
Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
cs.CR 2026-05 unverdicted novelty 8.0

Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
SecGoal: A Benchmark for Security Goal Extraction and Formalization from Protocol Documents
cs.CR 2026-04 unverdicted novelty 8.0

The paper presents SecGoal, the first expert-annotated benchmark for security goal extraction from protocol documents, and demonstrates that fine-tuned 7B/9B parameter models achieve over 80% F1 score, outperforming l...
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
cs.RO 2026-05 unverdicted novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization
math.OC 2026-05 conditional novelty 7.0

Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
cs.AI 2026-05 unverdicted novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
cs.LG 2026-05 unverdicted novelty 7.0

Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
cs.LG 2026-05 unverdicted novelty 7.0

PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.
Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA
cs.LG 2026-05 unverdicted novelty 7.0

GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.
Implicit Representations of Grammaticality in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.
FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models
cs.AI 2026-05 conditional novelty 7.0

FinSTaR reaches 78.9% accuracy on a new financial time series reasoning benchmark by applying Compute-in-CoT for deterministic assessments and Scenario-Aware CoT for stochastic predictions.
How Language Models Process Negation
cs.CL 2026-05 unverdicted novelty 7.0

LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
cs.SE 2026-05 unverdicted novelty 7.0

Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing ...
Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
cs.SE 2026-05 unverdicted novelty 7.0

Themis builds a multilingual benchmark and large preference dataset to train code reward models that score outputs on multiple criteria like correctness, efficiency, and style.
E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems
cs.CR 2026-05 unverdicted novelty 7.0

E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
cs.LG 2026-04 unverdicted novelty 7.0

Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
astro-ph.GA 2026-04 unverdicted novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
cs.CL 2026-04 unverdicted novelty 7.0

Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs
cs.CL 2026-04 unverdicted novelty 7.0

LLMs exhibit a clear preference for Japanese culture when answering open cultural questions, with this bias emerging after supervised fine-tuning rather than during pre-training.
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
cs.LG 2026-04 unverdicted novelty 7.0

A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 7.0

Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
cs.CL 2026-04 unverdicted novelty 7.0

MORPHOGEN is a new multilingual benchmark for testing LLMs on gender-aware morphological generation via rewriting first-person sentences to the opposite gender in French, Arabic, and Hindi.
LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution
cs.CL 2026-04 unverdicted novelty 7.0

PIE prunes CLT features first via FAP and FAP-Synergy to match baseline circuit fidelity at lower feature budgets on IOI and Doc-String tasks, reducing interpretation costs.
Conjunctive Prompt Attacks in Multi-Agent LLM Systems
cs.MA 2026-04 unverdicted novelty 7.0

Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
Response-Aware User Memory Selection for LLM Personalization
cs.AI 2026-04 unverdicted novelty 7.0

RUMS selects LLM user memory via mutual information with model outputs to reduce response uncertainty, outperforming similarity-based methods in human alignment and response quality with up to 95% lower cost.
Ruling Out to Rule In: Contrastive Hypothesis Retrieval for Medical Question Answering
cs.IR 2026-04 unverdicted novelty 7.0

CHR improves medical question answering retrieval by explicitly promoting evidence aligned with a correct hypothesis while penalizing content aligned with a plausible incorrect alternative.
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
cs.LG 2026-04 conditional novelty 7.0

Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
cs.CR 2024-10 unverdicted novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
cs.AI 2024-05 accept novelty 7.0

AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
Teacher-Guided Policy Optimization for LLM Distillation
cs.LG 2026-05 unverdicted novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset
cs.CL 2026-05 conditional novelty 6.0

ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to transla...
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
Domain Restriction via Multi SAE Layer Transitions
cs.AI 2026-05 unverdicted novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
cs.CL 2026-05 unverdicted novelty 6.0

MedTPE compresses EHR token sequences by up to 31% via merging common medical token pairs, reducing LLM inference latency 34-63% while maintaining or improving performance on mortality and phenotyping tasks.
Causal Bias Detection in Generative Artifical Intelligence
cs.AI 2026-05 unverdicted novelty 6.0

A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
Leveraging RAG for Training-Free Alignment of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation
cs.CV 2026-05 unverdicted novelty 6.0

Hi-GaTA is a gated temporal pyramid adapter that aggregates multi-scale video features via text-conditioned cross-attention and gated fusion to enable LLM-based surgical report generation, backed by a new 214-video be...
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure
cs.LG 2026-05 unverdicted novelty 6.0

Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
cs.CL 2026-05 unverdicted novelty 6.0

SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.
Towards Generation-Efficient Uncertainty Estimation in Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
stat.ML 2026-05 unverdicted novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels
cs.LG 2026-05 unverdicted novelty 6.0

CuBridge adapts expert CUDA attention kernels via LLM-driven lift-transfer-lower to produce correct, high-performance implementations for new variants across GPUs.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
cs.CR 2026-05 unverdicted novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
Conceptors for Semantic Steering
cs.LG 2026-05 unverdicted novelty 6.0

Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in mu...
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
cs.CL 2026-05 conditional novelty 6.0

AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
cs.PL 2026-05 unverdicted novelty 6.0

DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% ...
Minimizing Collateral Damage in Activation Steering
cs.LG 2026-05 unverdicted novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
cs.AI 2026-05 unverdicted novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
MUDY: Multi-Granular Dynamic Candidate Contextualization for Unsupervised Keyphrase Extraction
cs.IR 2026-05 unverdicted novelty 6.0

MUDY improves unsupervised keyphrase extraction by combining prompt-based scoring with candidate-aware weighting and self-attention-based multi-granular scoring to capture both local and global contextual salience, ou...

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 120 Pith papers · 27 internal anchors

[2]

Agarwal, N

R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[3]

Llama 3 model card, 2024

AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

work page 2024
[5]

Almazrouei, H

E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, Étienne Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo. The falcon series of open language models, 2023

work page 2023
[8]

Barham, A

P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Roy, B. Saeta, P. Schuh, R. Sepassi, L. E. Shafey, C. A. Thekkath, and Y. Wu. Pathways: Asynchronous distributed dataflow for ml, 2022

work page 2022
[15]

Chiang, L

W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024

work page 2024
[18]

Gemini: A family of highly capable multimodal models, 2023

Gemini Team . Gemini: A family of highly capable multimodal models, 2023

work page 2023
[19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024

Gemini Team . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024

work page 2024
[20]

Gemma: Open models based on gemini research and technology, 2024

Gemma Team . Gemma: Open models based on gemini research and technology, 2024

work page 2024
[21]

Y. Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[26]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023

work page 2023
[27]

Kahng, I

M. Kahng, I. Tenney, M. Pushkarna, M. X. Liu, J. Wexler, E. Reif, K. Kallarackal, M. Chang, M. Terry, and L. Dixon. Llm comparator: Visual analytics for side-by-side evaluation of large language models, 2024. URL https://arxiv.org/abs/2402.10524

work page arXiv 2024
[28]

Evaluating language-model agents on realistic autonomous tasks

M. Kinniment, L. J. K. Sato, H. Du, B. Goodrich, M. Hasin, L. Chan, L. H. Miles, T. R. Lin, H. Wijk, J. Burget, A. Ho, E. Barnes, and P. Christiano. Evaluating language-model agents on realistic autonomous tasks, 2024. URL https://arxiv.org/abs/2312.11671

work page arXiv 2024
[32]

Z. Lin, J. Cui, X. Liao, and X. Wang. Malla: Demystifying real-world large language model integrated malicious services, 2024. URL https://arxiv.org/abs/2401.03315

work page arXiv 2024
[34]

Personal Communication, 2024

Macknight, Aung, and Gomes. Personal Communication, 2024

work page 2024
[35]

Towards agile text classifiers for everyone, 2023

M. Mozes, J. Hoffmann, K. Tomanek, M. Kouate, N. Thain, A. Yuan, T. Bolukbasi, and L. Dixon. Towards agile text classifiers for everyone, 2023. URL https://arxiv.org/abs/2302.06541

work page arXiv 2023
[37]

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle

M. Phuong, M. Aitchison, E. Catt, S. Cogan, A. Kaskasoli, V. Krakovna, D. Lindner, M. Rahtz, Y. Assael, S. Hodkinson, H. Howard, T. Lieberum, R. Kumar, M. A. Raad, A. Webson, L. Ho, S. Lin, S. Farquhar, M. Hutter, G. Deletang, A. Ruoss, S. El-Sayed, S. Brown, A. Dragan, R. Shah, A. Dafoe, and T. Shevlane. Evaluating frontier models for dangerous capabilit...

work page arXiv 2024
[38]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners, 2019

work page 2019
[40]

A. Ramé, J. Ferret, N. Vieillard, R. Dadashi, L. Hussenot, P.-L. Cedoz, P. G. Sessa, S. Girgin, A. Douillard, and O. Bachem. Warp: On the benefits of weight averaged rewarded policies, 2024

work page 2024
[41]

J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He. \ Zero-offload \ : Democratizing \ billion-scale \ model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551--564, 2021

work page 2021
[42]

Roberts, H

A. Roberts, H. W. Chung, G. Mishra, A. Levskaya, J. Bradbury, D. Andor, S. Narang, B. Lester, C. Gaffney, A. Mohiuddin, et al. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24 0 (377): 0 1--8, 2023

work page 2023
[45]

arXiv preprint arXiv:2305.15324 , year=

T. Shevlane, S. Farquhar, B. Garfinkel, M. Phuong, J. Whittlestone, J. Leung, D. Kokotajlo, N. Marchal, M. Anderljung, N. Kolt, L. Ho, D. Siddarth, S. Avin, W. Hawkins, B. Kim, I. Gabriel, V. Bolina, J. Clark, Y. Bengio, P. Christiano, and A. Dafoe. Model evaluation for extreme risks, 2023. URL https://arxiv.org/abs/2305.15324

work page arXiv 2023
[47]

Suzgun, N

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022

work page 2022
[48]

Q. Team. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/qwen1.5/

work page 2024
[49]

Tenney, J

I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann, E. Jiang, M. Pushkarna, C. Radebaugh, E. Reif, and A. Yuan. The language interpretability tool: Extensible, interactive visualizations and analysis for nlp models, 2020. URL https://arxiv.org/abs/2008.05122

work page arXiv 2020
[50]

Touvron, T

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023

work page 2023
[53]

grok-1, 2024

xAI. grok-1, 2024. URL https://github.com/xai-org/grok-1

work page 2024
[54]

Xla: Optimizing compiler for tensorflow, 2019

XLA. Xla: Optimizing compiler for tensorflow, 2019. URL https://www.tensorflow.org/xla

work page 2019
[56]

J. Yang, A. Prabhakar, K. Narasimhan, and S. Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback, 2023. URL https://arxiv.org/abs/2306.14898

work page arXiv 2023
[59]

Neural Combinatorial Optimization with Reinforcement Learning

Irwan Bello and Hieu Pham and Quoc V. Le and Mohammad Norouzi and Samy Bengio , title =. CoRR , volume =. 2016 , url =. 1611.09940 , timestamp =

work page Pith review arXiv 2016
[60]

Concrete problems in

Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete problems in. arXiv preprint , year =

work page
[61]

Quantifying Memorization Across Neural Language Models

Quantifying memorization across neural language models , author=. arXiv preprint arXiv:2202.07646 , year=

work page internal anchor Pith review arXiv
[62]

Feder Cooper, Daphne Ippolito, Christopher A

Scalable extraction of training data from (production) language models , author=. arXiv preprint arXiv:2311.17035 , year=

work page arXiv
[63]

30th USENIX Security Symposium (USENIX Security 21) , pages=

Extracting training data from large language models , author=. 30th USENIX Security Symposium (USENIX Security 21) , pages=

work page
[64]

arXiv preprint arXiv:2210.17546 , year=

Preventing verbatim memorization in language models gives a false sense of privacy , author=. arXiv preprint arXiv:2210.17546 , year=

work page arXiv
[65]

arXiv preprint arXiv:2309.04662 , year=

Madlad-400: A multilingual and document-level large audited dataset , author=. arXiv preprint arXiv:2309.04662 , year=

work page arXiv
[66]

NeurIPS , year =

Defining and Characterizing Reward Gaming , author =. NeurIPS , year =

work page
[67]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[68]

2022 , eprint=

Scaling Laws for Reward Model Overoptimization , author=. 2022 , eprint=

work page 2022
[69]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

A baseline for detecting misclassified and out-of-distribution examples in neural networks , author=. arXiv preprint arXiv:1610.02136 , year=

work page internal anchor Pith review arXiv
[70]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review arXiv
[72]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review arXiv
[73]

Mastering the game of

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of. 2016 , publisher=

work page 2016
[74]

Proceedings of the 50th Annual International Symposium on Computer Architecture , pages=

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings , author=. Proceedings of the 50th Annual International Symposium on Computer Architecture , pages=

work page
[75]

2023 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=

work page 2023
[76]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1911.11641 , timestamp =

work page arXiv 2019
[77]

SocialIQA: Commonsense Reasoning about Social Interactions

Maarten Sap and Hannah Rashkin and Derek Chen and Ronan Le Bras and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1904.09728 , timestamp =

work page internal anchor Pith review arXiv 2019
[78]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark and Kenton Lee and Ming. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , journal =. 2019 , url =. 1905.10044 , timestamp =

work page internal anchor Pith review arXiv 2019
[79]

Transactions of the Association for Computational Linguistics , author =

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

work page doi:10.1162/tacl_a_00276 2019
[80]

Measuring Massive Multitask Language Understanding

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. CoRR , volume =. 2020 , url =. 2009.03300 , timestamp =

work page internal anchor Pith review arXiv 2020
[81]

Program Synthesis with Large Language Models

Jacob Austin and Augustus Odena and Maxwell I. Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J. Cai and Michael Terry and Quoc V. Le and Charles Sutton , title =. CoRR , volume =. 2021 , url =. 2108.07732 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[82]

Language Models are Unsupervised Multitask Learners , author=

work page
[83]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[84]

Evaluating Large Language Models Trained on Code

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Pond. Evaluating Large Language Models Trained on Code , journal =. 2021 , url =. 2107.03374 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[85]

Training Verifiers to Solve Math Word Problems

Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. CoRR , volume =. 2021 , url =. 2110.14168 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[86]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1907.10641 , timestamp =

work page internal anchor Pith review arXiv 2019
[87]

Denis Paperno and Germ. The. CoRR , volume =. 2016 , url =. 1606.06031 , timestamp =

work page Pith review arXiv 2016
[88]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer , title =. CoRR , volume =. 2017 , url =. 1705.03551 , timestamp =

work page internal anchor Pith review arXiv 2017
[89]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[90]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[91]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

work page 2023
[92]

2023 , eprint=

The Falcon Series of Open Language Models , author=. 2023 , eprint=

work page 2023
[93]

Textbooks Are All You Need II: phi-1.5 technical report

Textbooks are all you need ii: phi-1.5 technical report , author=. arXiv preprint arXiv:2309.05463 , year=

work page internal anchor Pith review arXiv
[94]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[95]

Ke Tran, Arianna Bisazza, and Christof Monz

Ilya Sutskever and Oriol Vinyals and Quoc V. Le , title =. CoRR , volume =. 2014 , url =. 1409.3215 , timestamp =

work page arXiv 2014
[96]

Attention Is All You Need

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =. CoRR , volume =. 2017 , url =. 1706.03762 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[97]

nature , volume=

Deep learning , author=. nature , volume=. 2015 , publisher=

work page 2015
[98]

2022 , eprint=

Pathways: Asynchronous Distributed Dataflow for ML , author=. 2022 , eprint=

work page 2022
[99]

Journal of Machine Learning Research , volume=

Scaling up models and data with t5x and seqio , author=. Journal of Machine Learning Research , volume=

work page
[100]

2019 , url=

XLA: Optimizing compiler for TensorFlow , author=. 2019 , url=

work page 2019
[101]

2022 , publisher=

How our principles helped define AlphaFold’s release , author=. 2022 , publisher=

work page 2022
[102]

Large Scale Distributed Deep Networks , url =

Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =

work page
[103]

Efficient Estimation of Word Representations in Vector Space , booktitle =

Tom. Efficient Estimation of Word Representations in Vector Space , booktitle =. 2013 , url =

work page 2013
[104]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin and Ming. CoRR , volume =. 2018 , url =. 1810.04805 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[105]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. CoRR , volume =. 2019 , url =. 1910.10683 , timestamp =

work page internal anchor Pith review arXiv 2019
[106]

Adam Roberts and Hyung Won Chung and Anselm Levskaya and Gaurav Mishra and James Bradbury and Daniel Andor and Sharan Narang and Brian Lester and Colin Gaffney and Afroz Mohiuddin and Curtis Hawthorne and Aitor Lewkowycz and Alex Salcianu and Marc van Zee and Jacob Austin and Sebastian Goodman and Livio Baldini Soares and Haitang Hu and Sasha Tsvyashchenk...

work page arXiv
[107]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer , title =. CoRR , volume =. 2019 , url =. 1911.02150 , timestamp =

work page internal anchor Pith review arXiv 2019
[108]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu , title =. CoRR , volume =. 2021 , url =. 2104.09864 , timestamp =

work page internal anchor Pith review arXiv 2021
[109]

2021 USENIX Annual Technical Conference (USENIX ATC 21) , pages=

\ Zero-offload \ : Democratizing \ billion-scale \ model training , author=. 2021 USENIX Annual Technical Conference (USENIX ATC 21) , pages=

work page 2021
[110]

GLU Variants Improve Transformer

Noam Shazeer , title =. CoRR , volume =. 2020 , url =. 2002.05202 , timestamp =

work page internal anchor Pith review arXiv 2020
[111]

Zhang and R

Biao Zhang and Rico Sennrich , title =. CoRR , volume =. 2019 , url =. 1910.07467 , timestamp =

work page arXiv 2019

Showing first 80 references.