pith. sign in

hub Mixed citations

2 OLMo 2 Furious

Mixed citation behavior. Most common role is background (46%).

91 Pith papers citing it
Background 46% of classified citations
abstract

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focusing on techniques for achieving better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from T\"ulu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to training compute, often matching or outperforming open-weight only models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with open-weight only models of comparable size and even some proprietary models like GPT-3.5 Turbo and GPT 4o Mini.

hub tools

citation-role summary

background 9 method 3 other 1

citation-polarity summary

claims ledger

  • abstract We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focusing on techniques for achieving better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which

co-cited works

clear filters

representative citing papers

Scaling limit of the Random Language Model

cond-mat.dis-nn · 2026-06-26 · unverdicted · novelty 8.0

In the scaling limit of the Random Language Model, a condensation transition occurs at x_c=1/8 with explicit scaling laws for rule usage and entropy derived from large-deviation principles and a mapping to Random Energy Models.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

Spurious Rewards: Rethinking Training Signals in RLVR

cs.AI · 2025-06-12 · accept · novelty 8.0

Spurious rewards in RLVR can produce large gains in mathematical reasoning for certain language models via GRPO's clipping bias amplifying pretraining behaviors like code reasoning.

Phase structure of the Random Language Model

cond-mat.dis-nn · 2026-06-26 · unverdicted · novelty 7.0

The Random Language Model exhibits a hierarchy of phase transitions in the double-scaling limit ε̃_d → 0, N → ∞ at fixed x = ε̃_d log N, with symbol correlations, non-uniform marginals, and glassy freezing, yielding scaling laws consistent with large language models.

Spectral Scaling Laws of Muon

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Muon momentum matrices show layer-dependent power-law scaling of stabilized singular value quantiles with model size from 77M to 2.8B parameters.

Characterizing the Expressivity of Local Attention in Transformers

cs.CL · 2026-05-01 · unverdicted · novelty 7.0 · 3 refs

Local attention in fixed-precision transformers introduces a second past operator in linear temporal logic, strictly increasing expressivity over global attention alone, with hybrids being most expressive.

Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers

cs.LG · 2026-04-26 · unverdicted · novelty 7.0

In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.