BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.
hub Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (56%).
abstract
We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce Olmo 3, a family of state-of-the-art, fully-open language models at the 7B and 32B parameter scales. Olmo 3 model construction targets long-context reasoning, function calling, coding, instruction following, general chat, and knowledge recall. This release includes the entire model flow, i.e., the full lifecycle of the family of models, including every stage, checkpoint, data point, and dependency used to build it. Our flagship model, Olmo 3 Think 32B, is the strongest fully-open thinking model released to-date.
co-cited works
representative citing papers
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.
LIFT is a learnability-informed SFT algorithm for diffusion LMs that aligns token difficulty with diffusion time steps, yielding up to 3x gains on AIME'24 and AIME'25 over standard SFT baselines.
MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
LaTER reduces LLM token usage 16-33% on reasoning benchmarks by exploring in latent space then switching to explicit CoT verification, with gains like 70% to 73.3% on AIME 2025 in the training-free version.
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.
Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than instruction-tuned ones.
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
CoDIT creates instruction-tuning datasets via contrastive decoding to isolate instruction-following capabilities, yielding models that outperform those trained on standard generated or public datasets.
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
LLMs show scaling and training-dependent alignment with human brain responses in creativity-related networks during divergent thinking tasks, measured via RSA on fMRI data.
LLM-ODE integrates large language models into genetic programming to guide symbolic search for governing equations of dynamical systems, outperforming classical GP on 91 test cases in efficiency and solution quality.
The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
citing papers explorer
-
BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks
BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.
-
Tracing Persona Vectors Through LLM Pretraining
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
-
Large Language Models Lack Temporal Awareness of Medical Knowledge
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
-
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
-
MultiHashFormer: Hash-based Generative Language Models
MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.
-
Learnability-Informed Fine-Tuning of Diffusion Language Models
LIFT is a learnability-informed SFT algorithm for diffusion LMs that aligns token difficulty with diffusion time steps, yielding up to 3x gains on AIME'24 and AIME'25 over standard SFT baselines.
-
MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs
MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.
-
Learning from Language Feedback via Variational Policy Distillation
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
-
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
-
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
-
LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification
LaTER reduces LLM token usage 16-33% on reasoning benchmarks by exploring in latent space then switching to explicit CoT verification, with gains like 70% to 73.3% on AIME 2025 in the training-free version.
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
Implicit Representations of Grammaticality in Language Models
Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.
-
The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining
Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than instruction-tuned ones.
-
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Synthesizing Instruction-Tuning Datasets with Contrastive Decoding
CoDIT creates instruction-tuning datasets via contrastive decoding to isolate instruction-following capabilities, yielding models that outperform those trained on standard generated or public datasets.
-
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
-
MARS: Enabling Autoregressive Models Multi-Token Generation
MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
-
DeonticBench: A Benchmark for Reasoning over Rules
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
-
Large Language Models Align with the Human Brain during Creative Thinking
LLMs show scaling and training-dependent alignment with human brain responses in creativity-related networks during divergent thinking tasks, measured via RSA on fMRI data.
-
LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models
LLM-ODE integrates large language models into genetic programming to guide symbolic search for governing equations of dynamical systems, outperforming classical GP on 91 test cases in efficiency and solution quality.
-
The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?
The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
-
CREATE: Testing LLMs for Associative Creativity
CREATE is a benchmark that scores LLMs on their ability to produce many specific and diverse associative paths between concepts drawn from parametric knowledge.
-
Enhancing LLM Metacognition via Cognitive Pairwise Training
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
-
Consolidating Rewarded Perturbations for LLM Post-Training
CoRP consolidates reward-weighted perturbations into a single model via low-rank structure, improving base LLMs by 8.1 points on average while using one-tenth the budget of prior ensembles and one forward pass.
-
Boundary-targeted Membership Inference Attacks on Safety Classifiers
A boundary-targeted MIA strategy recovers 19% of distress-flagged conversations from a safety classifier at 5% false-positive rate, 3.5 times better than prior methods.
-
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
-
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
RELEX extrapolates LLM checkpoints from short RLVR prefixes by projecting deltas onto a rank-1 subspace and fitting a linear trend, matching full training performance at 15% of the steps.
-
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.
-
HRM-Text: Efficient Pretraining Beyond Scaling
A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.
-
LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation
LP-Eval is a new expert-co-designed rubric and annotated dataset showing that LLMs mostly produce well-formed legal propositions from EU court decisions, with higher expert-rated quality for established cases and improved LLM-as-judge alignment when using the rubric.
-
The Evaluation Game: Beyond Static LLM Benchmarking
Presents a game-theoretic model with group actions for data augmentation in LLM adversarial evaluation, demonstrating local generalization from fine-tuning on three model families and redefining benchmarks as orbits under group actions.
-
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
-
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
Training installs a depth-dependent spectral gradient and low-rank bottleneck in LLM residual streams whose amplification or suppression of graph communities is predicted by local operator type.
-
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
-
Before the Last Token: Diagnosing Final-Token Safety Probe Failures
Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
-
Scaling Laws for Mixture Pretraining Under Data Constraints
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
-
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
-
Remember to Forget: Gated Adaptive Positional Encoding
GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
-
Post-training makes large language models less human-like
Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.
-
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Prescriptive Scaling Laws for Data Constrained Training
A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the penalty coefficient by ~70%.
-
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
-
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.