pith. machine review for the scientific record. sign in

arxiv: 2408.00118 · v3 · submitted 2024-07-31 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Gemma 2: Improving Open Language Models at a Practical Size

Abe Friesen, Alanna Walton, Alek Andreev, Alexandre Ram\'e, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, Anand Rao, Anca Dragan, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Anton Tsitsulin, Armand Joulin, Behnam Neyshabur, Ben Bastian, Bilal Piot, Bobak Shahriari, Bo Wu, Brandon Royal, Cassidy Hardin, Charlie Chen, Charline Le Lan, Chintu Kumar, Chris Perry, Christopher A. Choquette-Choo, Chris Welty, Clement Farabet, Danila Sinopalnikov, David Weinberger, Demis Hassabis, Dimple Vijaykumar, Dominika Rogozi\'nska, D. Sculley, Dustin Herbison, Elena Buchatskaya, Eli Collins, Elisa Bandy, Emma Wang, Erica Moreira, Eric Noland, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Gemma Team: Morgane Riviere, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Pluci\'nska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jean-bastien Grill, Jeanine Banks, Jeff Dean, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joelle Barral, Johan Ferret, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju-yeong Ji, Kareem Mohamed, Kartikeya Badola, Kat Black, Kathleen Kenealy, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Koray Kavukcuoglu, Lars Lowe Sjoesund, Laurent Sifre, Lauren Usui, Lena Heuermann, L\'eonard Hussenot, Leticia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Ludovic Peran, Machel Reid, Manvinder Singh, Mark Iverson, Martin G\"orner, Mateo Wirth, Matt Davidow, Matthew Rahtz, Matthew Watson, Matt Hoffman, Matt Miller, Mat Velloso, Meg Risdal, Mehran Kazemi, Michael Moynihan, Michelle Casbon, Ming Zhang, Minh Giang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nikola Momchev, Nilay Chauhan, Nino Vieillard, Noah Fiedel, Olivier Bachem, Oriol Vinyals, Oscar Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin, Peter Liu, Petko Georgiev, Phil Culliton, Phoebe Kirk, Pier Giuseppe Sessa, Piotr Stanczyk, Pouya Tafti, Pradeep Kuppala, Raia Hadsell, Ramona Comanescu, Ramona Merhej, Ravin Kumar, Reena Jana, Reza Ardeshir Rokni, Rishabh Agarwal, Robert Dadashi, Ryan Mullins, Sabela Ramos, Samaneh Saadat, Sammy Jerome, Sarah Cogan, Sarah Perrin, Sara Mc Carthy, Sebastian Borgeaud, Sebastian Krause, S\'ebastien M. R. Arnold, Sertan Girgin, Shantanu Thakoor, Shengyang Dai, Shreya Pathak, Shruti Garg, Shruti Sheth, Slav Petrov, Sue Ronstrom, Surya Bhupatiraju, Susan Chan, Thomas Mesnard, Timothy Jordan, Ting Yu, Tomas Kocisky, Tom Eccles, Tom Hennigan, Tris Warkentin, Tulsee Doshi, Victor Cotruta, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Zoubin Ghahramani

Pith reviewed 2026-05-10 12:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords open language modelsGemma 2knowledge distillationlocal-global attentiongroup-query attentiontransformer architecturemodel scalingperformance benchmarks
0
0 comments X

The pith

Gemma 2 models achieve leading performance at their sizes through interleaving local-global attention, group-query attention, and knowledge distillation on the smaller variants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Gemma 2 as an updated family of open language models with 2 billion to 27 billion parameters. It incorporates interleaving of local and global attention layers together with group-query attention in the Transformer backbone, while training the 2B and 9B versions via knowledge distillation rather than standard next-token prediction. These changes produce models that lead their size class on benchmarks and remain competitive with models two to three times larger. A reader would care because the work demonstrates concrete ways to extract more capability from models that fit on everyday hardware and can be released openly.

Core claim

The authors establish that applying interleaving local-global attentions and group-query attention across the model family, plus knowledge distillation for the 2B and 9B models, yields the best performance for each size and makes the models competitive alternatives to systems that are two to three times larger.

What carries the argument

The central mechanisms are the interleaving of local and global attention patterns within the Transformer layers combined with group-query attention, along with knowledge distillation applied specifically to the 2 billion and 9 billion parameter models.

If this is right

  • Open models at practical sizes can now substitute for much larger ones in many applications.
  • Hardware with modest memory can host capable language models without major quality loss.
  • Releasing the full range from 2B to 27B parameters widens access to high-performing open systems.
  • The same set of changes can be tested on future model scales to check if the efficiency pattern holds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may encourage other developers to prioritize attention-pattern changes over simply adding parameters when resources are constrained.
  • Wider adoption could shift industry focus toward measuring performance per parameter rather than raw scale alone.
  • If the gains replicate across different training runs, they would support using these modifications as a standard baseline for new open models.

Load-bearing premise

The reported gains in performance come from the listed architectural modifications and the switch to distillation rather than from unreported differences in training data volume, compute budget, or evaluation setup.

What would settle it

A controlled experiment that trains identical model sizes with the same data and compute but removes the local-global interleaving and group-query attention would show whether the performance edge disappears on the same benchmarks.

read the original abstract

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Gemma 2 family of open language models (2B, 9B, and 27B parameters). It applies known Transformer modifications including interleaving local-global attention and group-query attention, and trains the 2B and 9B models via knowledge distillation rather than next-token prediction. The central claim is that the resulting models achieve the best performance for their size and remain competitive with models 2-3 times larger; all models are released openly.

Significance. If the benchmark results are robust, the work supplies practically useful open models that advance the performance frontier at smaller scales, with the public release of weights enabling reproducibility and downstream research. This is a concrete contribution to accessible LLM development.

major comments (2)
  1. [Sections 2–3] Sections 2–3: The architectural changes (interleaved local-global attention, group-query attention) and switch to knowledge distillation for the 2B/9B models are described at a high level, yet no ablation experiments are reported that hold data mixture, token count, and compute fixed while removing one modification at a time. This leaves the attribution of reported benchmark gains to the listed techniques unsecured, as the central performance claim could be driven by undisclosed differences in pretraining data or scale.
  2. [Results section] Results section: Training data is characterized only qualitatively (web, code, math) with no token counts, source proportions, or direct comparison to the Gemma 1 mixture. Without these details or controlled ablations, it is impossible to isolate the contribution of the architectural and distillation choices from data effects, which routinely produce benchmark deltas of the reported magnitude.
minor comments (1)
  1. Ensure all benchmark tables include the exact evaluation protocols, number of runs, and any variance measures so that comparisons to 2–3× larger models can be reproduced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for your review and the constructive feedback on our Gemma 2 manuscript. We address the major comments point by point below, clarifying the scope of our contributions while noting where revisions can strengthen the presentation.

read point-by-point responses
  1. Referee: [Sections 2–3] Sections 2–3: The architectural changes (interleaved local-global attention, group-query attention) and switch to knowledge distillation for the 2B/9B models are described at a high level, yet no ablation experiments are reported that hold data mixture, token count, and compute fixed while removing one modification at a time. This leaves the attribution of reported benchmark gains to the listed techniques unsecured, as the central performance claim could be driven by undisclosed differences in pretraining data or scale.

    Authors: We agree that the absence of component-wise ablations with fixed data, tokens, and compute makes it difficult to isolate the contribution of each individual change. The manuscript presents the Gemma 2 models as a practical integration of established techniques (interleaved local-global attention, group-query attention, and knowledge distillation for the smaller variants), with the central contribution being the resulting performance at these scales and the public release of the weights. We did not perform the requested ablations, as they fall outside the primary goal of delivering and evaluating the final models. In revision we will add explicit language in Sections 2–3 stating that performance gains reflect the combined system and that controlled ablations remain an avenue for future work. revision: partial

  2. Referee: [Results section] Results section: Training data is characterized only qualitatively (web, code, math) with no token counts, source proportions, or direct comparison to the Gemma 1 mixture. Without these details or controlled ablations, it is impossible to isolate the contribution of the architectural and distillation choices from data effects, which routinely produce benchmark deltas of the reported magnitude.

    Authors: We acknowledge that qualitative descriptions alone leave open the possibility that data differences contribute to the observed gains. Gemma 2 uses an updated mixture that retains the core web, code, and math sources from Gemma 1 while increasing the proportion of high-quality mathematical and code data. Exact token counts and source proportions cannot be released for proprietary and competitive reasons. In the revised manuscript we will expand the data description in the Results section to include a qualitative comparison with the Gemma 1 mixture and to note that the architectural and distillation choices were applied on top of this updated data regime. revision: partial

standing simulated objections not resolved
  • Exact token counts, source proportions, and quantitative comparison tables for the pretraining data mixture, which cannot be disclosed due to proprietary constraints.

Circularity Check

0 steps flagged

No derivation chain present; empirical model release

full rationale

The paper introduces Gemma 2 models by describing the application of established techniques (interleaved local-global attention, group-query attention, and knowledge distillation) and reports benchmark performance. No equations, predictions, or first-principles derivations are claimed or present in the provided text. All cited methods are external (Beltagy et al., Ainslie et al., Hinton et al.), and results are measured against independent benchmarks with models released openly. The work contains no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard Transformer assumptions plus the effectiveness of the cited modifications; no new entities are postulated and the free parameters are the usual training hyperparameters and data choices that are not enumerated in the abstract.

free parameters (2)
  • model scale choices
    Selection of 2B, 9B, and 27B parameter counts as practical sizes
  • training hyperparameters
    Learning rates, batch sizes, and distillation temperatures not specified in abstract
axioms (2)
  • domain assumption Standard Transformer attention and feed-forward blocks remain effective when modified with local-global interleaving and group-query attention
    Invoked by citing Beltagy et al. and Ainslie et al. without re-derivation
  • domain assumption Knowledge distillation improves smaller models over next-token prediction alone
    Cited from Hinton et al. and applied to 2B/9B variants

pith-pipeline@v0.9.0 · 6321 in / 1432 out tokens · 29269 ms · 2026-05-10T12:06:11.309894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Masked Generative Transformer Is What You Need for Image Editing

    cs.CV 2026-05 unverdicted novelty 8.0

    EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.

  2. Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

    cs.CR 2026-05 unverdicted novelty 8.0

    Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...

  3. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

  4. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.

  5. SecGoal: A Benchmark for Security Goal Extraction and Formalization from Protocol Documents

    cs.CR 2026-04 unverdicted novelty 8.0

    The paper presents SecGoal, the first expert-annotated benchmark for security goal extraction from protocol documents, and demonstrates that fine-tuned 7B/9B parameter models achieve over 80% F1 score, outperforming l...

  6. ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

    cs.CL 2026-04 unverdicted novelty 8.0

    ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

  7. Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

    cs.RO 2026-05 unverdicted novelty 7.0

    A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

  8. Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization

    math.OC 2026-05 conditional novelty 7.0

    Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.

  9. Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.

  10. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  11. Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

    cs.AI 2026-05 unverdicted novelty 7.0

    Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

  12. Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

    cs.LG 2026-05 unverdicted novelty 7.0

    Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.

  13. PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

    cs.LG 2026-05 unverdicted novelty 7.0

    PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.

  14. Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA

    cs.LG 2026-05 unverdicted novelty 7.0

    GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.

  15. Implicit Representations of Grammaticality in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.

  16. FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

    cs.AI 2026-05 conditional novelty 7.0

    FinSTaR reaches 78.9% accuracy on a new financial time series reasoning benchmark by applying Compute-in-CoT for deterministic assessments and Scenario-Aware CoT for stochastic predictions.

  17. How Language Models Process Negation

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.

  18. Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

    cs.SE 2026-05 unverdicted novelty 7.0

    Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing ...

  19. Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

    cs.SE 2026-05 unverdicted novelty 7.0

    Themis builds a multilingual benchmark and large preference dataset to train code reward models that score outputs on multiple criteria like correctness, efficiency, and style.

  20. E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems

    cs.CR 2026-05 unverdicted novelty 7.0

    E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...

  21. Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

    cs.LG 2026-04 unverdicted novelty 7.0

    Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.

  22. Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

    astro-ph.GA 2026-04 unverdicted novelty 7.0

    A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

  23. Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

    cs.CL 2026-04 unverdicted novelty 7.0

    Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.

  24. Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs

    cs.CL 2026-04 unverdicted novelty 7.0

    LLMs exhibit a clear preference for Japanese culture when answering open cultural questions, with this bias emerging after supervised fine-tuning rather than during pre-training.

  25. How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.

  26. Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 7.0

    Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.

  27. MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    MORPHOGEN is a new multilingual benchmark for testing LLMs on gender-aware morphological generation via rewriting first-person sentences to the opposite gender in French, Arabic, and Hindi.

  28. LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation

    cs.CL 2026-04 unverdicted novelty 7.0

    LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.

  29. Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

    cs.CL 2026-04 unverdicted novelty 7.0

    PIE prunes CLT features first via FAP and FAP-Synergy to match baseline circuit fidelity at lower feature budgets on IOI and Doc-String tasks, reducing interpretation costs.

  30. Conjunctive Prompt Attacks in Multi-Agent LLM Systems

    cs.MA 2026-04 unverdicted novelty 7.0

    Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.

  31. Response-Aware User Memory Selection for LLM Personalization

    cs.AI 2026-04 unverdicted novelty 7.0

    RUMS selects LLM user memory via mutual information with model outputs to reduce response uncertainty, outperforming similarity-based methods in human alignment and response quality with up to 95% lower cost.

  32. Ruling Out to Rule In: Contrastive Hypothesis Retrieval for Medical Question Answering

    cs.IR 2026-04 unverdicted novelty 7.0

    CHR improves medical question answering retrieval by explicitly promoting evidence aligned with a correct hypothesis while penalizing content aligned with a plausible incorrect alternative.

  33. MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

    cs.LG 2026-04 conditional novelty 7.0

    Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.

  34. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  35. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    cs.CR 2024-10 unverdicted novelty 7.0

    ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...

  36. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    cs.AI 2024-05 accept novelty 7.0

    AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.

  37. Teacher-Guided Policy Optimization for LLM Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  38. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  39. ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset

    cs.CL 2026-05 conditional novelty 6.0

    ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to transla...

  40. Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...

  41. Domain Restriction via Multi SAE Layer Transitions

    cs.AI 2026-05 unverdicted novelty 6.0

    Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.

  42. From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

    cs.CL 2026-05 unverdicted novelty 6.0

    MedTPE compresses EHR token sequences by up to 31% via merging common medical token pairs, reducing LLM inference latency 34-63% while maintaining or improving performance on mortality and phenotyping tasks.

  43. Causal Bias Detection in Generative Artifical Intelligence

    cs.AI 2026-05 unverdicted novelty 6.0

    A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.

  44. Leveraging RAG for Training-Free Alignment of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...

  45. Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Hi-GaTA is a gated temporal pyramid adapter that aggregates multi-scale video features via text-conditioned cross-attention and gated fusion to enable LLM-based surgical report generation, backed by a new 214-video be...

  46. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  47. Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

    cs.LG 2026-05 unverdicted novelty 6.0

    Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.

  48. SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...

  49. Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

    cs.CL 2026-05 unverdicted novelty 6.0

    SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.

  50. Towards Generation-Efficient Uncertainty Estimation in Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.

  51. Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

    stat.ML 2026-05 unverdicted novelty 6.0

    Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...

  52. CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels

    cs.LG 2026-05 unverdicted novelty 6.0

    CuBridge adapts expert CUDA attention kernels via LLM-driven lift-transfer-lower to produce correct, high-performance implementations for new variants across GPUs.

  53. You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

    cs.CR 2026-05 unverdicted novelty 6.0

    NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...

  54. Conceptors for Semantic Steering

    cs.LG 2026-05 unverdicted novelty 6.0

    Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in mu...

  55. When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

    cs.CL 2026-05 conditional novelty 6.0

    AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.

  56. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

  57. DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

    cs.PL 2026-05 unverdicted novelty 6.0

    DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% ...

  58. Minimizing Collateral Damage in Activation Steering

    cs.LG 2026-05 unverdicted novelty 6.0

    Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

  59. PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.

  60. MUDY: Multi-Granular Dynamic Candidate Contextualization for Unsupervised Keyphrase Extraction

    cs.IR 2026-05 unverdicted novelty 6.0

    MUDY improves unsupervised keyphrase extraction by combining prompt-based scoring with candidate-aware weighting and self-attention-based multi-granular scoring to capture both local and global contextual salience, ou...

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 120 Pith papers · 27 internal anchors

  1. [2]

    Agarwal, N

    R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024

  2. [3]

    Llama 3 model card, 2024

    AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

  3. [5]

    Almazrouei, H

    E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, Étienne Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo. The falcon series of open language models, 2023

  4. [8]

    Barham, A

    P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Roy, B. Saeta, P. Schuh, R. Sepassi, L. E. Shafey, C. A. Thekkath, and Y. Wu. Pathways: Asynchronous distributed dataflow for ml, 2022

  5. [15]

    Chiang, L

    W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024

  6. [18]

    Gemini: A family of highly capable multimodal models, 2023

    Gemini Team . Gemini: A family of highly capable multimodal models, 2023

  7. [19]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024

    Gemini Team . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024

  8. [20]

    Gemma: Open models based on gemini research and technology, 2024

    Gemma Team . Gemma: Open models based on gemini research and technology, 2024

  9. [21]

    Y. Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024

  10. [26]

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023

  11. [27]

    Kahng, I

    M. Kahng, I. Tenney, M. Pushkarna, M. X. Liu, J. Wexler, E. Reif, K. Kallarackal, M. Chang, M. Terry, and L. Dixon. Llm comparator: Visual analytics for side-by-side evaluation of large language models, 2024. URL https://arxiv.org/abs/2402.10524

  12. [28]

    Evaluating language-model agents on realistic autonomous tasks

    M. Kinniment, L. J. K. Sato, H. Du, B. Goodrich, M. Hasin, L. Chan, L. H. Miles, T. R. Lin, H. Wijk, J. Burget, A. Ho, E. Barnes, and P. Christiano. Evaluating language-model agents on realistic autonomous tasks, 2024. URL https://arxiv.org/abs/2312.11671

  13. [32]

    Z. Lin, J. Cui, X. Liao, and X. Wang. Malla: Demystifying real-world large language model integrated malicious services, 2024. URL https://arxiv.org/abs/2401.03315

  14. [34]

    Personal Communication, 2024

    Macknight, Aung, and Gomes. Personal Communication, 2024

  15. [35]

    Towards agile text classifiers for everyone, 2023

    M. Mozes, J. Hoffmann, K. Tomanek, M. Kouate, N. Thain, A. Yuan, T. Bolukbasi, and L. Dixon. Towards agile text classifiers for everyone, 2023. URL https://arxiv.org/abs/2302.06541

  16. [37]

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle

    M. Phuong, M. Aitchison, E. Catt, S. Cogan, A. Kaskasoli, V. Krakovna, D. Lindner, M. Rahtz, Y. Assael, S. Hodkinson, H. Howard, T. Lieberum, R. Kumar, M. A. Raad, A. Webson, L. Ho, S. Lin, S. Farquhar, M. Hutter, G. Deletang, A. Ruoss, S. El-Sayed, S. Brown, A. Dragan, R. Shah, A. Dafoe, and T. Shevlane. Evaluating frontier models for dangerous capabilit...

  17. [38]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners, 2019

  18. [40]

    A. Ramé, J. Ferret, N. Vieillard, R. Dadashi, L. Hussenot, P.-L. Cedoz, P. G. Sessa, S. Girgin, A. Douillard, and O. Bachem. Warp: On the benefits of weight averaged rewarded policies, 2024

  19. [41]

    J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He. \ Zero-offload \ : Democratizing \ billion-scale \ model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551--564, 2021

  20. [42]

    Roberts, H

    A. Roberts, H. W. Chung, G. Mishra, A. Levskaya, J. Bradbury, D. Andor, S. Narang, B. Lester, C. Gaffney, A. Mohiuddin, et al. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24 0 (377): 0 1--8, 2023

  21. [45]

    arXiv preprint arXiv:2305.15324 , year=

    T. Shevlane, S. Farquhar, B. Garfinkel, M. Phuong, J. Whittlestone, J. Leung, D. Kokotajlo, N. Marchal, M. Anderljung, N. Kolt, L. Ho, D. Siddarth, S. Avin, W. Hawkins, B. Kim, I. Gabriel, V. Bolina, J. Clark, Y. Bengio, P. Christiano, and A. Dafoe. Model evaluation for extreme risks, 2023. URL https://arxiv.org/abs/2305.15324

  22. [47]

    Suzgun, N

    M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022

  23. [48]

    Q. Team. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/qwen1.5/

  24. [49]

    Tenney, J

    I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann, E. Jiang, M. Pushkarna, C. Radebaugh, E. Reif, and A. Yuan. The language interpretability tool: Extensible, interactive visualizations and analysis for nlp models, 2020. URL https://arxiv.org/abs/2008.05122

  25. [50]

    Touvron, T

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023

  26. [53]

    grok-1, 2024

    xAI. grok-1, 2024. URL https://github.com/xai-org/grok-1

  27. [54]

    Xla: Optimizing compiler for tensorflow, 2019

    XLA. Xla: Optimizing compiler for tensorflow, 2019. URL https://www.tensorflow.org/xla

  28. [56]

    J. Yang, A. Prabhakar, K. Narasimhan, and S. Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback, 2023. URL https://arxiv.org/abs/2306.14898

  29. [59]

    Neural Combinatorial Optimization with Reinforcement Learning

    Irwan Bello and Hieu Pham and Quoc V. Le and Mohammad Norouzi and Samy Bengio , title =. CoRR , volume =. 2016 , url =. 1611.09940 , timestamp =

  30. [60]

    Concrete problems in

    Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete problems in. arXiv preprint , year =

  31. [61]

    Quantifying Memorization Across Neural Language Models

    Quantifying memorization across neural language models , author=. arXiv preprint arXiv:2202.07646 , year=

  32. [62]

    Feder Cooper, Daphne Ippolito, Christopher A

    Scalable extraction of training data from (production) language models , author=. arXiv preprint arXiv:2311.17035 , year=

  33. [63]

    30th USENIX Security Symposium (USENIX Security 21) , pages=

    Extracting training data from large language models , author=. 30th USENIX Security Symposium (USENIX Security 21) , pages=

  34. [64]

    arXiv preprint arXiv:2210.17546 , year=

    Preventing verbatim memorization in language models gives a false sense of privacy , author=. arXiv preprint arXiv:2210.17546 , year=

  35. [65]

    arXiv preprint arXiv:2309.04662 , year=

    Madlad-400: A multilingual and document-level large audited dataset , author=. arXiv preprint arXiv:2309.04662 , year=

  36. [66]

    NeurIPS , year =

    Defining and Characterizing Reward Gaming , author =. NeurIPS , year =

  37. [67]

    2023 , eprint=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  38. [68]

    2022 , eprint=

    Scaling Laws for Reward Model Overoptimization , author=. 2022 , eprint=

  39. [69]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    A baseline for detecting misclassified and out-of-distribution examples in neural networks , author=. arXiv preprint arXiv:1610.02136 , year=

  40. [70]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

  41. [72]

    Training Compute-Optimal Large Language Models

    Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

  42. [73]

    Mastering the game of

    Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of. 2016 , publisher=

  43. [74]

    Proceedings of the 50th Annual International Symposium on Computer Architecture , pages=

    Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings , author=. Proceedings of the 50th Annual International Symposium on Computer Architecture , pages=

  44. [75]

    2023 , eprint=

    Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=

  45. [76]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1911.11641 , timestamp =

  46. [77]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Maarten Sap and Hannah Rashkin and Derek Chen and Ronan Le Bras and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1904.09728 , timestamp =

  47. [78]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark and Kenton Lee and Ming. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , journal =. 2019 , url =. 1905.10044 , timestamp =

  48. [79]

    Transactions of the Association for Computational Linguistics , author =

    Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

  49. [80]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. CoRR , volume =. 2020 , url =. 2009.03300 , timestamp =

  50. [81]

    Program Synthesis with Large Language Models

    Jacob Austin and Augustus Odena and Maxwell I. Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J. Cai and Michael Terry and Quoc V. Le and Charles Sutton , title =. CoRR , volume =. 2021 , url =. 2108.07732 , timestamp =

  51. [82]

    Language Models are Unsupervised Multitask Learners , author=

  52. [83]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  53. [84]

    Evaluating Large Language Models Trained on Code

    Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Pond. Evaluating Large Language Models Trained on Code , journal =. 2021 , url =. 2107.03374 , timestamp =

  54. [85]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. CoRR , volume =. 2021 , url =. 2110.14168 , timestamp =

  55. [86]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1907.10641 , timestamp =

  56. [87]

    Denis Paperno and Germ. The. CoRR , volume =. 2016 , url =. 1606.06031 , timestamp =

  57. [88]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer , title =. CoRR , volume =. 2017 , url =. 1705.03551 , timestamp =

  58. [89]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  59. [90]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  60. [91]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  61. [92]

    2023 , eprint=

    The Falcon Series of Open Language Models , author=. 2023 , eprint=

  62. [93]

    Textbooks Are All You Need II: phi-1.5 technical report

    Textbooks are all you need ii: phi-1.5 technical report , author=. arXiv preprint arXiv:2309.05463 , year=

  63. [94]

    Distilling the Knowledge in a Neural Network

    Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

  64. [95]

    Ke Tran, Arianna Bisazza, and Christof Monz

    Ilya Sutskever and Oriol Vinyals and Quoc V. Le , title =. CoRR , volume =. 2014 , url =. 1409.3215 , timestamp =

  65. [96]

    Attention Is All You Need

    Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =. CoRR , volume =. 2017 , url =. 1706.03762 , timestamp =

  66. [97]

    nature , volume=

    Deep learning , author=. nature , volume=. 2015 , publisher=

  67. [98]

    2022 , eprint=

    Pathways: Asynchronous Distributed Dataflow for ML , author=. 2022 , eprint=

  68. [99]

    Journal of Machine Learning Research , volume=

    Scaling up models and data with t5x and seqio , author=. Journal of Machine Learning Research , volume=

  69. [100]

    2019 , url=

    XLA: Optimizing compiler for TensorFlow , author=. 2019 , url=

  70. [101]

    2022 , publisher=

    How our principles helped define AlphaFold’s release , author=. 2022 , publisher=

  71. [102]

    Large Scale Distributed Deep Networks , url =

    Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =

  72. [103]

    Efficient Estimation of Word Representations in Vector Space , booktitle =

    Tom. Efficient Estimation of Word Representations in Vector Space , booktitle =. 2013 , url =

  73. [104]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin and Ming. CoRR , volume =. 2018 , url =. 1810.04805 , timestamp =

  74. [105]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. CoRR , volume =. 2019 , url =. 1910.10683 , timestamp =

  75. [106]

    Adam Roberts and Hyung Won Chung and Anselm Levskaya and Gaurav Mishra and James Bradbury and Daniel Andor and Sharan Narang and Brian Lester and Colin Gaffney and Afroz Mohiuddin and Curtis Hawthorne and Aitor Lewkowycz and Alex Salcianu and Marc van Zee and Jacob Austin and Sebastian Goodman and Livio Baldini Soares and Haitang Hu and Sasha Tsvyashchenk...

  76. [107]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer , title =. CoRR , volume =. 2019 , url =. 1911.02150 , timestamp =

  77. [108]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu , title =. CoRR , volume =. 2021 , url =. 2104.09864 , timestamp =

  78. [109]

    2021 USENIX Annual Technical Conference (USENIX ATC 21) , pages=

    \ Zero-offload \ : Democratizing \ billion-scale \ model training , author=. 2021 USENIX Annual Technical Conference (USENIX ATC 21) , pages=

  79. [110]

    GLU Variants Improve Transformer

    Noam Shazeer , title =. CoRR , volume =. 2020 , url =. 2002.05202 , timestamp =

  80. [111]

    Zhang and R

    Biao Zhang and Rico Sennrich , title =. CoRR , volume =. 2019 , url =. 1910.07467 , timestamp =

Showing first 80 references.