Gemma 3 Technical Report

Gemma Team: Aishwarya Kamath , Johan Ferret , Shreya Pathak , Nino Vieillard , Ramona Merhej , Sarah Perrin , Tatiana Matejovicova , Alexandre Ram\'e

show 202 more authors

Morgane Rivi\`ere Louis Rouillard Thomas Mesnard Geoffrey Cideron Jean-bastien Grill Sabela Ramos Edouard Yvinec Michelle Casbon Etienne Pot Ivo Penchev Ga\"el Liu Francesco Visin Kathleen Kenealy Lucas Beyer Xiaohai Zhai Anton Tsitsulin Robert Busa-Fekete Alex Feng Noveen Sachdeva Benjamin Coleman Yi Gao Basil Mustafa Iain Barr Emilio Parisotto David Tian Matan Eyal Colin Cherry Jan-Thorsten Peter Danila Sinopalnikov Surya Bhupatiraju Rishabh Agarwal Mehran Kazemi Dan Malkin Ravin Kumar David Vilar Idan Brusilovsky Jiaming Luo Andreas Steiner Abe Friesen Abhanshu Sharma Abheesht Sharma Adi Mayrav Gilady Adrian Goedeckemeyer Alaa Saade Alexander Kolesnikov Alexei Bendebury Alvin Abdagic Amit Vadi Andr\'as Gy\"orgy Andr\'e Susano Pinto Anil Das Ankur Bapna Antoine Miech Antoine Yang Antonia Paterson Ashish Shenoy Ayan Chakrabarti Bilal Piot Bo Wu Bobak Shahriari Bryce Petrini Charlie Chen Charline Le Lan Christopher A. Choquette-Choo CJ Carey Cormac Brick Daniel Deutsch Danielle Eisenbud Dee Cattle Derek Cheng Dimitris Paparas Divyashree Shivakumar Sreepathihalli Doug Reid Dustin Tran Dustin Zelle Eric Noland Erwin Huizenga Eugene Kharitonov Frederick Liu Gagik Amirkhanyan Glenn Cameron Hadi Hashemi Hanna Klimczak-Pluci\'nska Harman Singh Harsh Mehta Harshal Tushar Lehri Hussein Hazimeh Ian Ballantyne Idan Szpektor Ivan Nardini Jean Pouget-Abadie Jetha Chan Joe Stanton John Wieting Jonathan Lai Jordi Orbay Joseph Fernandez Josh Newlan Ju-yeong Ji Jyotinder Singh Kat Black Kathy Yu Kevin Hui Kiran Vodrahalli Klaus Greff Linhai Qiu Marcella Valentine Marina Coelho Marvin Ritter Matt Hoffman Matthew Watson Mayank Chaturvedi Michael Moynihan Min Ma Nabila Babar Natasha Noy Nathan Byrd Nick Roy Nikola Momchev Nilay Chauhan Oskar Bunyan Pankil Botarda Paul Caron Paul Kishan Rubenstein Phil Culliton Philipp Schmid Pier Giuseppe Sessa Pingmei Xu Piotr Stanczyk Pouya Tafti Rakesh Shivanna Renjie Wu Renke Pan Reza Rokni Rob Willoughby Rohith Vallu Ryan Mullins Sammy Jerome Sara Smoot Sertan Girgin Shariq Iqbal Shashir Reddy Shruti Sheth Siim P\~oder Sijal Bhatnagar Sindhu Raghuram Panyam Sivan Eiger Susan Zhang Tianqi Liu Trevor Yacovone Tyler Liechty Uday Kalra Utku Evci Vedant Misra Vincent Roseberry Vlad Feinberg Vlad Kolesnikov Woohyun Han Woosuk Kwon Xi Chen Yinlam Chow Yuvein Zhu Zichuan Wei Zoltan Egyed Victor Cotruta Minh Giang Phoebe Kirk Anand Rao Jessica Lo Erica Moreira Luiz Gustavo Martins Omar Sanseviero Lucas Gonzalez Zach Gleicher Tris Warkentin Vahab Mirrokni Evan Senter Eli Collins Joelle Barral Zoubin Ghahramani Raia Hadsell Yossi Matias D. Sculley Slav Petrov Noah Fiedel Noam Shazeer Oriol Vinyals Jeff Dean Demis Hassabis Koray Kavukcuoglu Clement Farabet Elena Buchatskaya Jean-Baptiste Alayrac Rohan Anil Dmitry (Dima) Lepikhin Sebastian Borgeaud Olivier Bachem Armand Joulin Alek Andreev Cassidy Hardin Robert Dadashi L\'eonard Hussenot

Authors on Pith no claims yet

classification 💻 cs.CL cs.AI

keywords gemmamodelsabilitiesattentioncontextlocalachieveachieved

0 comments

read the original abstract

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
Lost in Translation: Do LVLM Judges Generalize Across Languages?
cs.CL 2026-04 unverdicted novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning
cs.CL 2026-04 conditional novelty 8.0

SAHM is the first Arabic financial benchmark with seven tasks including AAOIFI standards QA, fatwa reasoning, accounting exams, sentiment analysis, summarization, and event-cause reasoning, showing that Arabic fluency...
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents
cs.CV 2026-03 accept novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
cs.CV 2026-05 unverdicted novelty 7.0

ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs
cs.CR 2026-05 unverdicted novelty 7.0

MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.
MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design
cs.AI 2026-05 conditional novelty 7.0

GenCircuit-RL uses hierarchical verification rewards and curriculum learning in RL to generate correct genetic circuit code in SBOL, improving functional task success by 14-16 points and generalizing to novel biologic...
Inducing Artificial Uncertainty in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
cs.LG 2026-05 unverdicted novelty 7.0

Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
cs.CV 2026-05 unverdicted novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
Same Image, Different Meanings: Toward Retrieval of Context-Dependent Meanings
cs.IR 2026-05 unverdicted novelty 7.0

Image meanings grow more context-dependent with semantic abstraction, requiring narrative grounding for accurate retrieval at higher levels.
All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LLM tasks are supported by multiple distinct circuits rather than unique mechanisms, demonstrated via Overlap-Aware Sheaf Repulsion and the Distributive Dense Circuit Hypothesis.
A Causal Language Modeling Detour Improves Encoder Continued Pretraining
cs.CL 2026-05 conditional novelty 7.0

A temporary CLM phase followed by MLM decay during encoder continued pretraining outperforms standard MLM on biomedical tasks by 0.3-2.8pp across languages and model sizes.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
cs.CL 2026-05 unverdicted novelty 7.0

BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
cs.CR 2026-05 unverdicted novelty 7.0

A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack succ...
Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off
cs.CR 2026-05 unverdicted novelty 7.0

Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints
cs.CL 2026-05 unverdicted novelty 7.0

EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
cs.LG 2026-05 unverdicted novelty 7.0

CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
cs.CL 2026-05 unverdicted novelty 7.0

SalesSim benchmarks MLLMs as retail user simulators, finds gaps in persona adherence and over-persuasion, and introduces UserGRPO RL to raise decision alignment by 13.8%.
LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight
cs.LG 2026-05 unverdicted novelty 7.0

A secondary warden LLM halves the success rate of hidden-goal adversarial LLMs in steering user decisions while causing only minor interference with genuine interactions.
Rubric-based On-policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
cs.CL 2026-05 unverdicted novelty 7.0

MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of either static method, with an efficient LoRA-only variant outperforming prior adaptive approaches.
Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity
stat.ML 2026-05 unverdicted novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
cs.AI 2026-05 unverdicted novelty 7.0

Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI
cs.RO 2026-05 unverdicted novelty 7.0

RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
cs.AI 2026-05 unverdicted novelty 7.0

CrossCult-KIBench is a new benchmark for evaluating cross-cultural knowledge insertion in MLLMs, paired with the MCKI baseline method, showing current approaches fail to balance adaptation and preservation.
CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
cs.AI 2026-05 unverdicted novelty 7.0

CrossCult-KIBench provides 9,800 test cases for cross-cultural knowledge insertion in MLLMs and shows that existing methods cannot reliably adapt to one culture while preserving behavior in others.
TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity
cs.CL 2026-05 unverdicted novelty 7.0

TableVista benchmark finds foundation models maintain performance across visual styles but degrade sharply on complex table structures and vision-only settings.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
cs.PF 2026-05 unverdicted novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs
cs.CR 2026-05 unverdicted novelty 7.0

PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.
Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking
cs.LG 2026-05 unverdicted novelty 7.0

Residual connections align cross-layer gradients while symmetry-breaking activations prevent rotational drift, causing principal singular vectors of adjacent layers to align.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
cs.LG 2026-05 unverdicted novelty 7.0

An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
ScribbleEdit: Synthetic Data for Image Editing with Scribbles and Text
cs.CV 2026-05 conditional novelty 7.0

ScribbleEdit is a synthetic dataset combining scribbles and text for training image editing models that produce spatially aligned and semantically consistent results.
LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations
cs.CV 2026-05 unverdicted novelty 7.0

LIMSSR reformulates incomplete multimodal learning as LLM-driven sequence-to-score reasoning with prompt-guided imputation and mask-aware aggregation, outperforming baselines on action quality assessment without compl...
Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification
cs.CL 2026-05 unverdicted novelty 7.0

Prompt reformulations induce high variance in first-token safety probabilities from zero-shot VLMs, and a training-free mean ensemble over prompt families improves NLL on all tested pairs and ECE on most relative to s...
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
cs.LG 2026-04 unverdicted novelty 7.0

Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.
EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses
cs.CL 2026-04 unverdicted novelty 7.0

EmoTransCap creates the first large-scale dataset for discourse-level emotion transitions in speech, a multi-task recognition model, LLM-based annotations, and a controllable emotional speech synthesis system.
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
cs.CV 2026-04 conditional novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
cs.AI 2026-04 unverdicted novelty 7.0

XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation
cs.AI 2026-04 accept novelty 7.0

MetaGAI is a new large-scale benchmark for automated model and data card generation, constructed via semantic triangulation and multi-agent agents with human-in-the-loop verification.
Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection
cs.CV 2026-04 unverdicted novelty 7.0

Hierarchical confidence calibration and LoCLIP adaptation improve pseudo-label quality for open-vocabulary object detection, achieving new state-of-the-art results on COCO and LVIS benchmarks.
Evaluating Temporal Consistency in Multi-Turn Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Can Multimodal Large Language Models Truly Understand Small Objects?
cs.CV 2026-04 unverdicted novelty 7.0

Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning
cs.AI 2026-04 conditional novelty 7.0

Unembedding collapse in transformers prevents distinguishing unseen tokens in symbolic reasoning, but targeted interventions restore generalization.
Subject-level Inference for Realistic Text Anonymization Evaluation
cs.CL 2026-04 unverdicted novelty 7.0

SPIA benchmark reveals that subject-level inference protection falls to as low as 33% even after masking over 90% of PII spans, with non-target subjects remaining highly exposed under target-focused anonymization.
DistortBench: Benchmarking Vision Language Models on Image Distortion Identification
cs.CV 2026-04 unverdicted novelty 7.0

Vision-language models achieve at most 61.9% accuracy on identifying image distortion types and severities, falling short of human majority-vote performance at 65.7%.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
cs.CL 2026-04 unverdicted novelty 7.0

STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cos...
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
cs.CL 2026-04 unverdicted novelty 7.0

MORPHOGEN is a new multilingual benchmark for testing LLMs on gender-aware morphological generation via rewriting first-person sentences to the opposite gender in French, Arabic, and Hindi.
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
cs.CL 2026-04 unverdicted novelty 7.0

TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.