Recognition: no theorem link
Gemma 2: Improving Open Language Models at a Practical Size
Pith reviewed 2026-05-10 12:06 UTC · model grok-4.3
The pith
Gemma 2 models achieve leading performance at their sizes through interleaving local-global attention, group-query attention, and knowledge distillation on the smaller variants.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that applying interleaving local-global attentions and group-query attention across the model family, plus knowledge distillation for the 2B and 9B models, yields the best performance for each size and makes the models competitive alternatives to systems that are two to three times larger.
What carries the argument
The central mechanisms are the interleaving of local and global attention patterns within the Transformer layers combined with group-query attention, along with knowledge distillation applied specifically to the 2 billion and 9 billion parameter models.
If this is right
- Open models at practical sizes can now substitute for much larger ones in many applications.
- Hardware with modest memory can host capable language models without major quality loss.
- Releasing the full range from 2B to 27B parameters widens access to high-performing open systems.
- The same set of changes can be tested on future model scales to check if the efficiency pattern holds.
Where Pith is reading between the lines
- The approach may encourage other developers to prioritize attention-pattern changes over simply adding parameters when resources are constrained.
- Wider adoption could shift industry focus toward measuring performance per parameter rather than raw scale alone.
- If the gains replicate across different training runs, they would support using these modifications as a standard baseline for new open models.
Load-bearing premise
The reported gains in performance come from the listed architectural modifications and the switch to distillation rather than from unreported differences in training data volume, compute budget, or evaluation setup.
What would settle it
A controlled experiment that trains identical model sizes with the same data and compute but removes the local-global interleaving and group-query attention would show whether the performance edge disappears on the same benchmarks.
read the original abstract
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Gemma 2 family of open language models (2B, 9B, and 27B parameters). It applies known Transformer modifications including interleaving local-global attention and group-query attention, and trains the 2B and 9B models via knowledge distillation rather than next-token prediction. The central claim is that the resulting models achieve the best performance for their size and remain competitive with models 2-3 times larger; all models are released openly.
Significance. If the benchmark results are robust, the work supplies practically useful open models that advance the performance frontier at smaller scales, with the public release of weights enabling reproducibility and downstream research. This is a concrete contribution to accessible LLM development.
major comments (2)
- [Sections 2–3] Sections 2–3: The architectural changes (interleaved local-global attention, group-query attention) and switch to knowledge distillation for the 2B/9B models are described at a high level, yet no ablation experiments are reported that hold data mixture, token count, and compute fixed while removing one modification at a time. This leaves the attribution of reported benchmark gains to the listed techniques unsecured, as the central performance claim could be driven by undisclosed differences in pretraining data or scale.
- [Results section] Results section: Training data is characterized only qualitatively (web, code, math) with no token counts, source proportions, or direct comparison to the Gemma 1 mixture. Without these details or controlled ablations, it is impossible to isolate the contribution of the architectural and distillation choices from data effects, which routinely produce benchmark deltas of the reported magnitude.
minor comments (1)
- Ensure all benchmark tables include the exact evaluation protocols, number of runs, and any variance measures so that comparisons to 2–3× larger models can be reproduced.
Simulated Author's Rebuttal
Thank you for your review and the constructive feedback on our Gemma 2 manuscript. We address the major comments point by point below, clarifying the scope of our contributions while noting where revisions can strengthen the presentation.
read point-by-point responses
-
Referee: [Sections 2–3] Sections 2–3: The architectural changes (interleaved local-global attention, group-query attention) and switch to knowledge distillation for the 2B/9B models are described at a high level, yet no ablation experiments are reported that hold data mixture, token count, and compute fixed while removing one modification at a time. This leaves the attribution of reported benchmark gains to the listed techniques unsecured, as the central performance claim could be driven by undisclosed differences in pretraining data or scale.
Authors: We agree that the absence of component-wise ablations with fixed data, tokens, and compute makes it difficult to isolate the contribution of each individual change. The manuscript presents the Gemma 2 models as a practical integration of established techniques (interleaved local-global attention, group-query attention, and knowledge distillation for the smaller variants), with the central contribution being the resulting performance at these scales and the public release of the weights. We did not perform the requested ablations, as they fall outside the primary goal of delivering and evaluating the final models. In revision we will add explicit language in Sections 2–3 stating that performance gains reflect the combined system and that controlled ablations remain an avenue for future work. revision: partial
-
Referee: [Results section] Results section: Training data is characterized only qualitatively (web, code, math) with no token counts, source proportions, or direct comparison to the Gemma 1 mixture. Without these details or controlled ablations, it is impossible to isolate the contribution of the architectural and distillation choices from data effects, which routinely produce benchmark deltas of the reported magnitude.
Authors: We acknowledge that qualitative descriptions alone leave open the possibility that data differences contribute to the observed gains. Gemma 2 uses an updated mixture that retains the core web, code, and math sources from Gemma 1 while increasing the proportion of high-quality mathematical and code data. Exact token counts and source proportions cannot be released for proprietary and competitive reasons. In the revised manuscript we will expand the data description in the Results section to include a qualitative comparison with the Gemma 1 mixture and to note that the architectural and distillation choices were applied on top of this updated data regime. revision: partial
- Exact token counts, source proportions, and quantitative comparison tables for the pretraining data mixture, which cannot be disclosed due to proprietary constraints.
Circularity Check
No derivation chain present; empirical model release
full rationale
The paper introduces Gemma 2 models by describing the application of established techniques (interleaved local-global attention, group-query attention, and knowledge distillation) and reports benchmark performance. No equations, predictions, or first-principles derivations are claimed or present in the provided text. All cited methods are external (Beltagy et al., Ainslie et al., Hinton et al.), and results are measured against independent benchmarks with models released openly. The work contains no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- model scale choices
- training hyperparameters
axioms (2)
- domain assumption Standard Transformer attention and feed-forward blocks remain effective when modified with local-global interleaving and group-query attention
- domain assumption Knowledge distillation improves smaller models over next-token prediction alone
Forward citations
Cited by 60 Pith papers
-
Masked Generative Transformer Is What You Need for Image Editing
EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.
-
Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
-
SecGoal: A Benchmark for Security Goal Extraction and Formalization from Protocol Documents
The paper presents SecGoal, the first expert-annotated benchmark for security goal extraction from protocol documents, and demonstrates that fine-tuned 7B/9B parameter models achieve over 80% F1 score, outperforming l...
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
-
Uncovering Symmetry Transfer in Large Language Models via Layer-Peeled Optimization
Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.
-
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
-
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
-
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.
-
Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA
GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.
-
Implicit Representations of Grammaticality in Language Models
Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.
-
FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models
FinSTaR reaches 78.9% accuracy on a new financial time series reasoning benchmark by applying Compute-in-CoT for deterministic assessments and Scenario-Aware CoT for stochastic predictions.
-
How Language Models Process Negation
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
-
Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing ...
-
Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
Themis builds a multilingual benchmark and large preference dataset to train code reward models that score outputs on multiple criteria like correctness, efficiency, and style.
-
E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems
E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...
-
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.
-
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
-
Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs
LLMs exhibit a clear preference for Japanese culture when answering open cultural questions, with this bias emerging after supervised fine-tuning rather than during pre-training.
-
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
MORPHOGEN is a new multilingual benchmark for testing LLMs on gender-aware morphological generation via rewriting first-person sentences to the opposite gender in French, Arabic, and Hindi.
-
LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
-
Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution
PIE prunes CLT features first via FAP and FAP-Synergy to match baseline circuit fidelity at lower feature budgets on IOI and Doc-String tasks, reducing interpretation costs.
-
Conjunctive Prompt Attacks in Multi-Agent LLM Systems
Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
-
Response-Aware User Memory Selection for LLM Personalization
RUMS selects LLM user memory via mutual information with model outputs to reduce response uncertainty, outperforming similarity-based methods in human alignment and response quality with up to 95% lower cost.
-
Ruling Out to Rule In: Contrastive Hypothesis Retrieval for Medical Question Answering
CHR improves medical question answering retrieval by explicitly promoting evidence aligned with a correct hypothesis while penalizing content aligned with a plausible incorrect alternative.
-
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
-
Teacher-Guided Policy Optimization for LLM Distillation
TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
-
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
-
ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset
ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to transla...
-
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
-
Domain Restriction via Multi SAE Layer Transitions
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
-
From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
MedTPE compresses EHR token sequences by up to 31% via merging common medical token pairs, reducing LLM inference latency 34-63% while maintaining or improving performance on mortality and phenotyping tasks.
-
Causal Bias Detection in Generative Artifical Intelligence
A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
-
Leveraging RAG for Training-Free Alignment of LLMs
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
-
Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation
Hi-GaTA is a gated temporal pyramid adapter that aggregates multi-scale video features via text-conditioned cross-attention and gated fusion to enable LLM-based surgical report generation, backed by a new 214-video be...
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure
Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
-
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.
-
Towards Generation-Efficient Uncertainty Estimation in Large Language Models
Uncertainty estimation for LLM hallucinations can be done effectively with partial generations or input-only predictors, reducing the need for full autoregressive sampling.
-
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
-
CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels
CuBridge adapts expert CUDA attention kernels via LLM-driven lift-transfer-lower to produce correct, high-performance implementations for new variants across GPUs.
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
-
Conceptors for Semantic Steering
Conceptors as soft projection matrices from bipolar activations offer a multidimensional, compositional, and geometrically principled method for semantic steering in LLMs that outperforms single-vector baselines in mu...
-
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
DITRON introduces a hierarchical multi-level tiling compiler for distributed tensor programs that matches or exceeds expert CUDA libraries with 6-30% speedups and has been deployed to improve training MFU by over 10% ...
-
Minimizing Collateral Damage in Activation Steering
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
-
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
-
MUDY: Multi-Granular Dynamic Candidate Contextualization for Unsupervised Keyphrase Extraction
MUDY improves unsupervised keyphrase extraction by combining prompt-based scoring with candidate-aware weighting and self-attention-based multi-granular scoring to capture both local and global contextual salience, ou...
Reference graph
Works this paper leans on
-
[2]
R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[3]
AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
work page 2024
-
[5]
E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, Étienne Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo. The falcon series of open language models, 2023
work page 2023
- [8]
- [15]
-
[18]
Gemini: A family of highly capable multimodal models, 2023
Gemini Team . Gemini: A family of highly capable multimodal models, 2023
work page 2023
-
[19]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024
Gemini Team . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024
work page 2024
-
[20]
Gemma: Open models based on gemini research and technology, 2024
Gemma Team . Gemma: Open models based on gemini research and technology, 2024
work page 2024
-
[21]
Y. Gu, L. Dong, F. Wei, and M. Huang. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[26]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023
work page 2023
- [27]
-
[28]
Evaluating language-model agents on realistic autonomous tasks
M. Kinniment, L. J. K. Sato, H. Du, B. Goodrich, M. Hasin, L. Chan, L. H. Miles, T. R. Lin, H. Wijk, J. Burget, A. Ho, E. Barnes, and P. Christiano. Evaluating language-model agents on realistic autonomous tasks, 2024. URL https://arxiv.org/abs/2312.11671
- [32]
-
[34]
Macknight, Aung, and Gomes. Personal Communication, 2024
work page 2024
-
[35]
Towards agile text classifiers for everyone, 2023
M. Mozes, J. Hoffmann, K. Tomanek, M. Kouate, N. Thain, A. Yuan, T. Bolukbasi, and L. Dixon. Towards agile text classifiers for everyone, 2023. URL https://arxiv.org/abs/2302.06541
-
[37]
M. Phuong, M. Aitchison, E. Catt, S. Cogan, A. Kaskasoli, V. Krakovna, D. Lindner, M. Rahtz, Y. Assael, S. Hodkinson, H. Howard, T. Lieberum, R. Kumar, M. A. Raad, A. Webson, L. Ho, S. Lin, S. Farquhar, M. Hutter, G. Deletang, A. Ruoss, S. El-Sayed, S. Brown, A. Dragan, R. Shah, A. Dafoe, and T. Shevlane. Evaluating frontier models for dangerous capabilit...
-
[38]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners, 2019
work page 2019
-
[40]
A. Ramé, J. Ferret, N. Vieillard, R. Dadashi, L. Hussenot, P.-L. Cedoz, P. G. Sessa, S. Girgin, A. Douillard, and O. Bachem. Warp: On the benefits of weight averaged rewarded policies, 2024
work page 2024
-
[41]
J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y. He. \ Zero-offload \ : Democratizing \ billion-scale \ model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551--564, 2021
work page 2021
-
[42]
A. Roberts, H. W. Chung, G. Mishra, A. Levskaya, J. Bradbury, D. Andor, S. Narang, B. Lester, C. Gaffney, A. Mohiuddin, et al. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24 0 (377): 0 1--8, 2023
work page 2023
-
[45]
arXiv preprint arXiv:2305.15324 , year=
T. Shevlane, S. Farquhar, B. Garfinkel, M. Phuong, J. Whittlestone, J. Leung, D. Kokotajlo, N. Marchal, M. Anderljung, N. Kolt, L. Ho, D. Siddarth, S. Avin, W. Hawkins, B. Kim, I. Gabriel, V. Bolina, J. Clark, Y. Bengio, P. Christiano, and A. Dafoe. Model evaluation for extreme risks, 2023. URL https://arxiv.org/abs/2305.15324
- [47]
-
[48]
Q. Team. Introducing qwen1.5, February 2024. URL https://qwenlm.github.io/blog/qwen1.5/
work page 2024
-
[49]
I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann, E. Jiang, M. Pushkarna, C. Radebaugh, E. Reif, and A. Yuan. The language interpretability tool: Extensible, interactive visualizations and analysis for nlp models, 2020. URL https://arxiv.org/abs/2008.05122
-
[50]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023
work page 2023
- [53]
-
[54]
Xla: Optimizing compiler for tensorflow, 2019
XLA. Xla: Optimizing compiler for tensorflow, 2019. URL https://www.tensorflow.org/xla
work page 2019
- [56]
-
[59]
Neural Combinatorial Optimization with Reinforcement Learning
Irwan Bello and Hieu Pham and Quoc V. Le and Mohammad Norouzi and Samy Bengio , title =. CoRR , volume =. 2016 , url =. 1611.09940 , timestamp =
work page Pith review arXiv 2016
-
[60]
Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete problems in. arXiv preprint , year =
-
[61]
Quantifying Memorization Across Neural Language Models
Quantifying memorization across neural language models , author=. arXiv preprint arXiv:2202.07646 , year=
work page internal anchor Pith review arXiv
-
[62]
Feder Cooper, Daphne Ippolito, Christopher A
Scalable extraction of training data from (production) language models , author=. arXiv preprint arXiv:2311.17035 , year=
-
[63]
30th USENIX Security Symposium (USENIX Security 21) , pages=
Extracting training data from large language models , author=. 30th USENIX Security Symposium (USENIX Security 21) , pages=
-
[64]
arXiv preprint arXiv:2210.17546 , year=
Preventing verbatim memorization in language models gives a false sense of privacy , author=. arXiv preprint arXiv:2210.17546 , year=
-
[65]
arXiv preprint arXiv:2309.04662 , year=
Madlad-400: A multilingual and document-level large audited dataset , author=. arXiv preprint arXiv:2309.04662 , year=
- [66]
-
[67]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
-
[68]
Scaling Laws for Reward Model Overoptimization , author=. 2022 , eprint=
work page 2022
-
[69]
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
A baseline for detecting misclassified and out-of-distribution examples in neural networks , author=. arXiv preprint arXiv:1610.02136 , year=
work page internal anchor Pith review arXiv
-
[70]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=
work page internal anchor Pith review arXiv
-
[72]
Training Compute-Optimal Large Language Models
Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=
work page internal anchor Pith review arXiv
-
[73]
Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of. 2016 , publisher=
work page 2016
-
[74]
Proceedings of the 50th Annual International Symposium on Computer Architecture , pages=
Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings , author=. Proceedings of the 50th Annual International Symposium on Computer Architecture , pages=
-
[75]
Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=
work page 2023
-
[76]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1911.11641 , timestamp =
-
[77]
SocialIQA: Commonsense Reasoning about Social Interactions
Maarten Sap and Hannah Rashkin and Derek Chen and Ronan Le Bras and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1904.09728 , timestamp =
work page internal anchor Pith review arXiv 2019
-
[78]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark and Kenton Lee and Ming. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , journal =. 2019 , url =. 1905.10044 , timestamp =
work page internal anchor Pith review arXiv 2019
-
[79]
Transactions of the Association for Computational Linguistics , author =
Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...
-
[80]
Measuring Massive Multitask Language Understanding
Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. CoRR , volume =. 2020 , url =. 2009.03300 , timestamp =
work page internal anchor Pith review arXiv 2020
-
[81]
Program Synthesis with Large Language Models
Jacob Austin and Augustus Odena and Maxwell I. Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J. Cai and Michael Terry and Quoc V. Le and Charles Sutton , title =. CoRR , volume =. 2021 , url =. 2108.07732 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[82]
Language Models are Unsupervised Multitask Learners , author=
-
[83]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[84]
Evaluating Large Language Models Trained on Code
Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Pond. Evaluating Large Language Models Trained on Code , journal =. 2021 , url =. 2107.03374 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[85]
Training Verifiers to Solve Math Word Problems
Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. CoRR , volume =. 2021 , url =. 2110.14168 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[86]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1907.10641 , timestamp =
work page internal anchor Pith review arXiv 2019
-
[87]
Denis Paperno and Germ. The. CoRR , volume =. 2016 , url =. 1606.06031 , timestamp =
work page Pith review arXiv 2016
-
[88]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer , title =. CoRR , volume =. 2017 , url =. 1705.03551 , timestamp =
work page internal anchor Pith review arXiv 2017
-
[89]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[90]
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
work page 2023
- [91]
- [92]
-
[93]
Textbooks Are All You Need II: phi-1.5 technical report
Textbooks are all you need ii: phi-1.5 technical report , author=. arXiv preprint arXiv:2309.05463 , year=
work page internal anchor Pith review arXiv
-
[94]
Distilling the Knowledge in a Neural Network
Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[95]
Ke Tran, Arianna Bisazza, and Christof Monz
Ilya Sutskever and Oriol Vinyals and Quoc V. Le , title =. CoRR , volume =. 2014 , url =. 1409.3215 , timestamp =
-
[96]
Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =. CoRR , volume =. 2017 , url =. 1706.03762 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [97]
-
[98]
Pathways: Asynchronous Distributed Dataflow for ML , author=. 2022 , eprint=
work page 2022
-
[99]
Journal of Machine Learning Research , volume=
Scaling up models and data with t5x and seqio , author=. Journal of Machine Learning Research , volume=
- [100]
-
[101]
How our principles helped define AlphaFold’s release , author=. 2022 , publisher=
work page 2022
-
[102]
Large Scale Distributed Deep Networks , url =
Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =
-
[103]
Efficient Estimation of Word Representations in Vector Space , booktitle =
Tom. Efficient Estimation of Word Representations in Vector Space , booktitle =. 2013 , url =
work page 2013
-
[104]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin and Ming. CoRR , volume =. 2018 , url =. 1810.04805 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[105]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. CoRR , volume =. 2019 , url =. 1910.10683 , timestamp =
work page internal anchor Pith review arXiv 2019
-
[106]
Adam Roberts and Hyung Won Chung and Anselm Levskaya and Gaurav Mishra and James Bradbury and Daniel Andor and Sharan Narang and Brian Lester and Colin Gaffney and Afroz Mohiuddin and Curtis Hawthorne and Aitor Lewkowycz and Alex Salcianu and Marc van Zee and Jacob Austin and Sebastian Goodman and Livio Baldini Soares and Haitang Hu and Sasha Tsvyashchenk...
-
[107]
Fast Transformer Decoding: One Write-Head is All You Need
Noam Shazeer , title =. CoRR , volume =. 2019 , url =. 1911.02150 , timestamp =
work page internal anchor Pith review arXiv 2019
-
[108]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu , title =. CoRR , volume =. 2021 , url =. 2104.09864 , timestamp =
work page internal anchor Pith review arXiv 2021
-
[109]
2021 USENIX Annual Technical Conference (USENIX ATC 21) , pages=
\ Zero-offload \ : Democratizing \ billion-scale \ model training , author=. 2021 USENIX Annual Technical Conference (USENIX ATC 21) , pages=
work page 2021
-
[110]
GLU Variants Improve Transformer
Noam Shazeer , title =. CoRR , volume =. 2020 , url =. 2002.05202 , timestamp =
work page internal anchor Pith review arXiv 2020
-
[111]
Biao Zhang and Rico Sennrich , title =. CoRR , volume =. 2019 , url =. 1910.07467 , timestamp =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.