Olmo 3

Team Olmo: Allyson Ettinger , Amanda Bertsch , Bailey Kuehl , David Graham , David Heineman , Dirk Groeneveld , Faeze Brahman , Finbarr Timbers , Hamish Ivison , Jacob Morrison , Jake Poznanski , Kyle Lo , Luca Soldaini , Matt Jordan , Mayee Chen , Michael Noukhovitch , Nathan Lambert , Pete Walsh , Pradeep Dasigi , Robert Berry , Saumya Malik , Saurabh Shah , Scott Geng , Shane Arora , Shashank Gupta , Taira Anderson , Teng Xiao , Tyler Murray , Tyler Romero , Victoria Graf , Akari Asai , Akshita Bhagia , Alexander Wettig , Alisa Liu , Aman Rangapur , Chloe Anastasiades , Costa Huang , Dustin Schwenk , Harsh Trivedi , Ian Magnusson , Jaron Lochner , Jiacheng Liu , Lester James V. Miranda , Maarten Sap , Malia Morgan , Michael Schmitz , Michal Guerquin , Michael Wilson , Regan Huff , Ronan Le Bras , Rui Xin , Rulin Shao , Sam Skjonsberg , Shannon Zejiang Shen , Shuyue Stella Li , Tucker Wilde , Valentina Pyatkin , Will Merrill , Yapei Chang , Yuling Gu , Zhiyuan Zeng , Ashish Sabharwal , Luke Zettlemoyer , Pang Wei Koh , Ali Farhadi , Noah A. Smith , Hannaneh Hajishirzi

Authors on Pith no claims yet

classification 💻 cs.CL cs.LG

keywords modelolmofamilyfully-openmodelsbuildcallingchat

0 comments

read the original abstract

We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 52 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tracing Persona Vectors Through LLM Pretraining
cs.CL 2026-05 unverdicted novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
Large Language Models Lack Temporal Awareness of Medical Knowledge
cs.LG 2026-05 unverdicted novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
Pretraining Exposure Explains Popularity Judgments in Large Language Models
cs.CL 2026-05 unverdicted novelty 8.0

LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs
cs.CR 2026-05 unverdicted novelty 7.0

MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.
Scaling Laws for Mixture Pretraining Under Data Constraints
cs.LG 2026-05 conditional novelty 7.0

Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
cs.CL 2026-05 unverdicted novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
cs.LG 2026-05 unverdicted novelty 7.0

AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification
cs.CL 2026-05 unverdicted novelty 7.0

LaTER reduces LLM token usage 16-33% on reasoning benchmarks by exploring in latent space then switching to explicit CoT verification, with gains like 70% to 73.3% on AIME 2025 in the training-free version.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
cs.CL 2026-05 unverdicted novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Implicit Representations of Grammaticality in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.
The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining
cs.CY 2026-05 unverdicted novelty 7.0

Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than ...
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
cs.CL 2026-04 unverdicted novelty 7.0

Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
Psychological Steering of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Synthesizing Instruction-Tuning Datasets with Contrastive Decoding
cs.CL 2026-04 conditional novelty 7.0

CoDIT creates instruction-tuning datasets via contrastive decoding to isolate instruction-following capabilities, yielding models that outperform those trained on standard generated or public datasets.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
cs.AI 2026-04 unverdicted novelty 7.0

SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.
MARS: Enabling Autoregressive Models Multi-Token Generation
cs.CL 2026-04 unverdicted novelty 7.0

MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
DeonticBench: A Benchmark for Reasoning over Rules
cs.CL 2026-04 unverdicted novelty 7.0

DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
Large Language Models Align with the Human Brain during Creative Thinking
q-bio.NC 2026-04 unverdicted novelty 7.0

LLMs show scaling and training-dependent alignment with human brain responses in creativity-related networks during divergent thinking tasks, measured via RSA on fMRI data.
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
cs.LG 2026-05 unverdicted novelty 6.0

Training installs a depth-dependent spectral gradient and low-rank bottleneck in LLM residual streams whose amplification or suppression of graph communities is predicted by local operator type.
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
cs.LG 2026-05 unverdicted novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Before the Last Token: Diagnosing Final-Token Safety Probe Failures
cs.LG 2026-05 unverdicted novelty 6.0

Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
cs.LG 2026-05 conditional novelty 6.0

Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
cs.AI 2026-05 unverdicted novelty 6.0

RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
Remember to Forget: Gated Adaptive Positional Encoding
cs.LG 2026-05 unverdicted novelty 6.0

GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
Post-training makes large language models less human-like
cs.CL 2026-05 unverdicted novelty 6.0

Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
cs.CL 2026-05 conditional novelty 6.0

Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Prescriptive Scaling Laws for Data Constrained Training
cs.LG 2026-05 unverdicted novelty 6.0

A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the ...
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

LLMs favor task-appropriate reasoning over conflicting instructions, yet reasoning types are linearly encoded in middle-to-late layers and can be steered to boost instruction compliance by up to 29%.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
cs.LG 2026-04 unverdicted novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...
Estimating Tail Risks in Language Model Output Distributions
cs.LG 2026-04 unverdicted novelty 6.0

Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.
TEMPO: Scaling Test-time Training for Large Reasoning Models
cs.LG 2026-04 unverdicted novelty 6.0

TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
cs.AI 2026-04 conditional novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
cs.LG 2026-04 unverdicted novelty 6.0

RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
cs.CL 2026-04 unverdicted novelty 6.0

GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking
cs.LG 2026-04 unverdicted novelty 6.0

RLVR-trained LLMs exploit verifier weaknesses by producing non-generalizable outputs on rule-induction tasks, detectable via Isomorphic Perturbation Testing.
MEMENTO: Teaching LLMs to Manage Their Own Context
cs.AI 2026-04 unverdicted novelty 6.0

MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
cs.LG 2026-04 unverdicted novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
Seed Bank, Co-op, Stoop Swap: Metaphors for Governing Language Model Data for Creative Writing
cs.HC 2026-05 unverdicted novelty 5.0

Workshops with over 100 creative writers produced metaphors and four themes for language model governance that favor consent-driven, smaller open models encoding community values.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
cs.AI 2026-05 unverdicted novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
Multilinguality at the Edge: Developing Language Models for the Global South
cs.CL 2026-04 unverdicted novelty 5.0

A survey of 232 papers on the intersection of multilingual language modeling and edge deployment identifies the 'last mile' challenge for Global South communities and offers recommendations for more inclusive NLP.
Hyperloop Transformers
cs.LG 2026-04 unverdicted novelty 5.0

Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?
cs.CL 2026-04 conditional novelty 5.0

Injecting 1% synthetic data targeting specific constructions during pre-training of GPT-2 Small boosts performance on 8 of 9 weakest BLiMP paradigms (e.g., only_npi_scope from 20.9% to 69.4%), while aggregate performa...
Your Model Diversity, Not Method, Determines Reasoning Strategy
cs.AI 2026-04 unverdicted novelty 5.0

The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
Apriel-1.5-OpenReasoner: RL Post-Training for General-Purpose and Efficient Reasoning
cs.LG 2026-04 unverdicted novelty 5.0

Apriel-1.5-OpenReasoner uses RL post-training with adaptive sampling and difficulty-aware penalties to boost reasoning accuracy on AIME, GPQA, MMLU-Pro and LiveCodeBench while producing shorter traces and generalizing...
Reflections and New Directions for Human-Centered Large Language Models
cs.CL 2026-05 unverdicted novelty 4.0

Model developers must address human concerns, preferences, values, and goals with rigor at every stage of the LLM pipeline rather than only in post-training.