arxiv: 2502.02737 · v1 · submitted 2025-02-04 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal , Anton Lozhkov , Elie Bakouch , Gabriel Mart\'in Bl\'azquez , Guilherme Penedo , Lewis Tunstall , Andr\'es Marafioti , Hynek Kydl\'i\v{c}ek

show 14 more authors

Agust\'in Piqueres Lajar\'in Vaibhav Srivastav Joshua Lochner Caleb Fahlgren Xuan-Son Nguyen Cl\'ementine Fourrier Ben Burtenshaw Hugo Larcher Haojun Zhao Cyril Zakka Mathieu Morlon Colin Raffel Leandro Von Werra Thomas Wolf

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:25 UTC · model grok-4.3

classification 💻 cs.CL

keywords small language modelsdata-centric trainingdataset mixingspecialized datasetsinstruction followingmodel performance

0 comments

The pith

SmolLM2 shows a 1.7 billion parameter model can surpass other small language models by training on eleven trillion tokens of carefully mixed data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper documents the development of SmolLM2, a 1.7 billion parameter language model trained on roughly eleven trillion tokens. It uses a multi-stage process that combines general web text with specialized math, code, and instruction-following data. New datasets are created to fill gaps in existing resources for mathematics, code education, and conversations. Dataset mixing rates are adjusted at each stage based on observed performance. This data-centric approach produces a model that outperforms recent small models such as Qwen2.5-1.5B and Llama3.2-1B.

Core claim

SmolLM2, a 1.7B parameter model, reaches higher performance than comparable small language models by overtraining on approximately 11 trillion tokens through a multi-stage regimen that mixes web text with math, code, and instruction data, using newly prepared datasets FineMath, Stack-Edu, and SmolTalk where prior collections proved insufficient.

What carries the argument

The multi-stage data mixing process that iteratively updates dataset proportions according to performance at the prior stage, together with the introduction of specialized datasets to address quality and quantity shortfalls.

If this is right

Small language models can become competitive for deployment in resource-constrained settings.
Releasing both the model and the prepared datasets enables further community experiments on efficient training.
Data curation and iterative mixing can function as the main lever for capability gains without increasing model size.
Later small models may follow similar stage-wise refinement of data mixtures to close gaps with larger peers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Heavy reliance on data volume and quality could reduce the total compute needed to reach given performance levels.
The same iterative mixing strategy might transfer to training efficient models in non-language domains.
Optimal data proportions may differ systematically with model size, inviting targeted experiments on that relationship.

Load-bearing premise

Performance gains come chiefly from the multi-stage data mixing and new datasets rather than from differences in training compute, hyperparameters, or evaluation setup.

What would settle it

Training a model of the same size on the same total tokens but with standard datasets and fixed mixing rates, then measuring no improvement over the baselines on the same benchmarks, would falsify the claim.

read the original abstract

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SmolLM2 shows a 1.7B model can beat recent peers after 11T tokens of staged mixing with new datasets, but the gains may trace more to total scale than to the curation itself.

read the letter

SmolLM2 gets a 1.7B model to outperform Qwen2.5-1.5B and Llama3.2-1B by training on roughly 11 trillion tokens through a multi-stage process that mixes web data with math, code, and instruction data. The authors add three new datasets—FineMath, Stack-Edu, and SmolTalk—where existing ones fell short, and they adjust mixing rates manually after each stage based on observed performance. Small-scale ablations guide those choices, and the model weights plus all datasets are released openly. That combination of concrete recipe and open artifacts is the clearest contribution here. Anyone building small models for deployment can pull the data and try the schedule directly. The practical focus on filling dataset gaps and iterating on mixes is useful and reproducible on its own terms. The soft spot is attribution. The abstract and stress-test note both flag that the comparators may have seen far fewer tokens overall. Without a compute-matched baseline that uses only standard web data at the same 11T scale, it remains unclear how much the new datasets and manual mixing drive the edge versus raw overtraining volume. The small ablations help, but they do not fully close that loop. This paper is aimed at groups working on data-centric training for models under 2B parameters. It supplies usable artifacts and a documented process, so it merits a serious referee even if the causal claims on data quality need tighter controls in revision.

Referee Report

3 major / 3 minor

Summary. The paper presents SmolLM2, a 1.7B-parameter language model trained on approximately 11 trillion tokens using a multi-stage process with dynamic data mixing of web text, math, code, and instruction data. The authors introduce three new datasets (FineMath, Stack-Edu, SmolTalk) to address perceived gaps in existing corpora and use small-scale ablations plus manual mixing-rate updates based on prior-stage performance to guide training. The central claim is that SmolLM2 outperforms recent comparators such as Qwen2.5-1.5B and Llama3.2-1B on standard benchmarks.

Significance. If the performance gains are shown to be driven by the data-centric choices rather than raw token volume or unstated hyperparameter differences, the work would provide a concrete, reproducible recipe for high-quality small-model training. The release of the model weights and the three new datasets is a clear strength, enabling downstream research on data curation and scaling laws for the sub-2B regime.

major comments (3)

[§4, §5] §4 (Training) and §5 (Experiments): the paper must include a compute-matched baseline that uses the same total token count (~11T) and architecture but replaces the multi-stage mixing and new datasets with a standard web-only or fixed-ratio mix. Without this, the attribution of gains to FineMath/Stack-Edu/SmolTalk and the dynamic mixing schedule cannot be isolated from the effect of overtraining scale.
[Table 2] Table 2 / Figure 3: the reported benchmark scores for SmolLM2 versus Qwen2.5-1.5B and Llama3.2-1B must be accompanied by the exact training-token counts and FLOPs for each comparator. If the baselines were trained on substantially fewer tokens, the performance gap cannot be credited to data curation alone.
[§3.2] §3.2 (Ablations): the small-scale ablation experiments need to report variance across random seeds and include a control that keeps total compute fixed while varying only the presence/absence of the new datasets. Current ablations appear to conflate data quality with training duration.

minor comments (3)

[Abstract] The abstract states 'outperforms' without quoting any numbers; the main text should move the headline numbers into the abstract or first paragraph of the introduction for immediate visibility.
[§4] Notation for mixing rates and stage transitions is introduced informally; a single consolidated table listing per-stage token counts, mixing ratios, and learning-rate schedules would improve reproducibility.
[Related Work] Missing reference to recent data-centric scaling papers (e.g., DataComp, Dolma) that use similar multi-stage mixing; situating the manual refinement process against those automated baselines would strengthen the contribution.

Simulated Author's Rebuttal

3 responses · 1 unresolved

Thank you for the constructive feedback. We address each major comment below with clarifications and indicate planned revisions.

read point-by-point responses

Referee: [§4, §5] §4 (Training) and §5 (Experiments): the paper must include a compute-matched baseline that uses the same total token count (~11T) and architecture but replaces the multi-stage mixing and new datasets with a standard web-only or fixed-ratio mix. Without this, the attribution of gains to FineMath/Stack-Edu/SmolTalk and the dynamic mixing schedule cannot be isolated from the effect of overtraining scale.

Authors: We agree that a full compute-matched baseline would strengthen causal attribution. However, training a second 1.7B model on 11T tokens exceeds our available compute budget. We will revise §§4–5 to explicitly acknowledge this limitation, expand the existing small-scale fixed-compute ablations that isolate dataset effects, and release training scripts so the community can run such controls. revision: partial
Referee: [Table 2] Table 2 / Figure 3: the reported benchmark scores for SmolLM2 versus Qwen2.5-1.5B and Llama3.2-1B must be accompanied by the exact training-token counts and FLOPs for each comparator. If the baselines were trained on substantially fewer tokens, the performance gap cannot be credited to data curation alone.

Authors: We will update Table 2 with the best publicly reported figures: Qwen2.5-1.5B on ~18T tokens and Llama3.2-1B on ~9T tokens (per Meta announcements). We will add a FLOPs column using the standard 6ND approximation and a footnote discussing comparison caveats when exact proprietary counts are unavailable. revision: yes
Referee: [§3.2] §3.2 (Ablations): the small-scale ablation experiments need to report variance across random seeds and include a control that keeps total compute fixed while varying only the presence/absence of the new datasets. Current ablations appear to conflate data quality with training duration.

Authors: We will revise §3.2 to report means and standard deviations over three random seeds. We will also add a new fixed-token-count control that substitutes equivalent volumes of general web data for FineMath/Stack-Edu/SmolTalk, thereby isolating data-quality effects from training duration. revision: yes

standing simulated objections not resolved

A full-scale 11T-token compute-matched baseline using only standard web data, which would require prohibitive additional compute.

Circularity Check

0 steps flagged

No significant circularity in empirical training and evaluation

full rationale

The paper describes an empirical process of multi-stage training on ~11T tokens with data mixing and new datasets (FineMath, Stack-Edu, SmolTalk), evaluated via measured performance on external benchmarks. No equations, fitted parameters, or derivations are presented that would make the outperformance claim equivalent to the training inputs by construction. Small-scale ablations and manual mixing-rate updates based on prior-stage results are iterative design steps, not self-referential predictions. The central claim rests on direct comparisons to other models rather than any self-citation chain or uniqueness theorem that reduces to the authors' prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that next-token prediction on the described data mixture produces the reported benchmark improvements; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)

data mixing rates
Updated manually after each stage based on prior performance; exact values not stated in abstract.

axioms (1)

standard math Next-token prediction is an effective objective for language modeling
Implicit foundation of all autoregressive LM training described.

pith-pipeline@v0.9.0 · 5642 in / 1232 out tokens · 35317 ms · 2026-05-13T17:25:11.114099+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk)
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Model Spec Midtraining: Improving How Alignment Training Generalizes
cs.AI 2026-05 unverdicted novelty 8.0

Model spec midtraining trains AI models on documents about their alignment rules before demonstration fine-tuning, producing stronger and more controllable generalization to the intended values and safety behaviors.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
cs.CL 2026-05 conditional novelty 7.0

K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
cs.PF 2026-05 unverdicted novelty 7.0

A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
cs.CR 2026-04 conditional novelty 7.0

Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.
A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 7.0

A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
Internalized Reasoning for Long-Context Visual Document Understanding
cs.CV 2026-03 unverdicted novelty 7.0

A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
cs.LG 2026-05 conditional novelty 6.0

Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
Efficient Pre-Training with Token Superposition
cs.CL 2026-05 unverdicted novelty 6.0

Token superposition in an initial training phase followed by recovery allows large language models to reach target loss with substantially less total compute.
What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
6G Needs Agents: Toward Agentic AI-Native Networks for Autonomous Intelligence
cs.NI 2026-05 unverdicted novelty 6.0

6G networks need LLM-based agents in a layered semantic control plane to achieve autonomous intelligence, with empirical results showing that heterogeneous deployment across device-edge-core is required due to inheren...
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
cs.AI 2026-05 unverdicted novelty 6.0

Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
cs.AR 2026-04 unverdicted novelty 6.0

A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds
cs.CL 2026-04 unverdicted novelty 6.0

Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...
Dream 7B: Diffusion Large Language Models
cs.CL 2025-08 unverdicted novelty 6.0

Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
cs.LG 2025-06 unverdicted novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
SmolVLM: Redefining small and efficient multimodal models
cs.AI 2025-04 unverdicted novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
cs.RO 2025-03 unverdicted novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback
cs.CV 2026-05 unverdicted novelty 5.0

SkillFormer, PATS, and ProfVLM deliver state-of-the-art multi-view proficiency estimation on Ego-Exo4D with up to 20x fewer parameters by combining selective fusion, dense sampling, and generative feedback.
Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?
cs.CL 2026-04 conditional novelty 5.0

Injecting 1% synthetic data targeting specific constructions during pre-training of GPT-2 Small boosts performance on 8 of 9 weakest BLiMP paradigms (e.g., only_npi_scope from 20.9% to 69.4%), while aggregate performa...
TinyMU: A Compact Audio-Language Model for Music Understanding
cs.SD 2026-04 unverdicted novelty 5.0

TinyMU is a 229M-parameter compact music understanding model that achieves 82% of state-of-the-art large audio-language model performance on the MuChoMusic benchmark while being 35 times smaller.

Reference graph

Works this paper leans on

189 extracted references · 189 canonical work pages · cited by 24 Pith papers · 38 internal anchors

[1]

ArXiv , year=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. ArXiv , year=

work page
[2]

Penedo, Guilherme and Malartic, Quentin and Hesslow, Daniel and Cojocaru, Ruxandra and Alobeidli, Hamza and Cappelli, Alessandro and Pannier, Baptiste and Almazrouei, Ebtesam and Launay, Julien , booktitle=. The

work page
[3]

2024 , eprint=

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research , author=. 2024 , eprint=

work page 2024
[4]

2023 , eprint=

Scaling Data-Constrained Language Models , author=. 2023 , eprint=

work page 2023
[5]

Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan , title =

work page
[6]

Together Computer , title =

work page
[7]

Datacomp-

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and others , journal=. Datacomp-

work page
[8]

2023 , publisher =

OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants , author =. 2023 , publisher =

work page 2023
[16]

ArXiv , year=

Qwen2 Technical Report , author=. ArXiv , year=

work page
[18]

Advances in Neural Information Processing Systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[21]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mugglemath: Assessing the impact of query and response augmentation on math reasoning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[22]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track year=

MathPile: A Billion-Token-Scale Pretraining Corpus for Math , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track year=

work page
[25]

SmolLM - blazingly fast and remarkably powerful , author=

work page
[26]

2024 , url =

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models , author=. 2024 , url =

work page 2024
[27]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[29]

Language Models are Unsupervised Multitask Learners , author=

work page
[34]

Journal of Machine Learning Research , volume=

Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=

work page
[41]

2024 , howpublished=

OLMo 2: The best fully open language model to date , author=. 2024 , howpublished=

work page 2024
[43]

2024 , url =

Llama 3.2 Model Card , author=. 2024 , url =

work page 2024
[49]

EMNLP , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

work page
[50]

2019 , eprint=

PIQA: Reasoning about Physical Commonsense in Natural Language , author=. 2019 , eprint=

work page 2019
[51]

2019 , eprint=

SocialIQA: Commonsense Reasoning about Social Interactions , author=. 2019 , eprint=

work page 2019
[52]

2019 , eprint=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

work page 2019
[55]

Program Synthesis with Large Language Models

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

work page 2018
[57]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[58]

Go smol or go home , author =

work page
[61]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page
[69]

RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback , author=

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback , author=. Forty-first International Conference on Machine Learning year=

work page
[71]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[72]

Advances in Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

work page
[74]

EMNLP , year=

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author=. EMNLP , year=

work page
[75]

2024 , booktitle=

Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards , author=. 2024 , booktitle=

work page 2024
[78]

2022 , eprint=

Scaling Vision Transformers , author=. 2022 , eprint=

work page 2022
[79]

2024 , eprint=

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies , author=. 2024 , eprint=

work page 2024
[80]

Proceedings

On the resemblance and containment of documents , author=. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171) , year=

work page 1997
[83]

Barbaresi, Adrien. Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

work page
[84]

Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast , booktitle =

work page
[85]

Multilingual E5 Text Embeddings: A Technical Report

Multilingual E5 Text Embeddings: A Technical Report , author=. arXiv preprint arXiv:2402.05672 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[88]

Distilabel: An

Bartolom. Distilabel: An. GitHub repository , howpublished =. 2024 , publisher =

work page 2024
[89]

Advances in Neural Information Processing Systems , volume=

Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

work page
[90]

The Twelfth International Conference on Learning Representations , year=

The unlocking spell on base llms: Rethinking alignment via in-context learning , author=. The Twelfth International Conference on Learning Representations , year=

work page
[91]

and Kaiser, ukasz and Polosukhin, Illia , booktitle =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , year =

work page
[92]

Loshchilov, Ilya and Hutter, Frank , journal=

work page
[94]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[96]

2024 , eprint=

How to Train Long-Context Language Models (Effectively) , author=. 2024 , eprint=

work page 2024
[97]

2024 , eprint=

HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly , author=. 2024 , eprint=

work page 2024
[99]

OpenAI GPTBot Documentation , howpublished =

work page
[100]

ClaudeBot Documentation , howpublished =

work page
[101]

Common Crawl , howpublished =

work page
[102]

, author=

Amplify-Instruct: Synthetically Generated Diverse Multi-turn Conversations for efficient LLM Training. , author=. arXiv preprint arXiv:(coming soon) , url=

work page
[103]

Forty-first International Conference on Machine Learning , year=

ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback , author=. Forty-first International Conference on Machine Learning , year=

work page
[104]

2023 , howpublished =

Kaokao Lv and Wenxin Zhang and Haihao Shen , title =. 2023 , howpublished =

work page 2023
[108]

Alpacaeval: An automatic evaluator of instruction-following models , author=

work page
[109]

Advances in Neural Information Processing Systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in Neural Information Processing Systems , volume=

work page
[110]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[111]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[112]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Rewritelm: An instruction-tuned large language model for text rewriting , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[114]

2024 , journal=

SelfCodeAlign: Self-Alignment for Code Generation , author=. 2024 , journal=

work page 2024
[116]

2024 , howpublished =

Cognitive Computations , title =. 2024 , howpublished =

work page 2024
[119]

Hugging Face repository , howpublished =

Jia Li and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. Hugging Face repository , howpublished =. 2024 , publisher =

work page 2024
[120]

arXiv preprint arXiv:2309.05653 , year=

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning , author=. arXiv preprint arXiv:2309.05653 , year=

work page arXiv
[121]

2024 , howpublished =

Argilla , title =. 2024 , howpublished =

work page 2024
[122]

2024 , eprint=

Scaling Synthetic Data Creation with 1,000,000,000 Personas , author=. 2024 , eprint=

work page 2024
[126]

Transactions of the Association for Computational Linguistics , volume=

Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=

work page
[127]

2024 , howpublished =

Hugging Face , title =. 2024 , howpublished =

work page 2024
[129]

2024 , howpublished =

Garrett Kamradt , title =. 2024 , howpublished =

work page 2024
[130]

Transactions of the Association for Computational Linguistics , volume=

Coqa: A conversational question answering challenge , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

work page 2019
[133]

2024 , url =

MosaicML , title =. 2024 , url =

work page 2024
[134]

https://darkvisitors.com/agents/claudebot

Claudebot documentation. https://darkvisitors.com/agents/claudebot. Accessed: 2024-06-05

work page 2024
[135]

https://commoncrawl.org/

Common crawl. https://commoncrawl.org/. Accessed: 2024-06-05

work page 2024
[136]

https://platform.openai.com/docs/gptbot

Openai gptbot documentation. https://platform.openai.com/docs/gptbot. Accessed: 2024-06-05

work page 2024
[137]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., Behl, H., et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[138]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[139]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[140]

Olmo 2: The best fully open language model to date

Ai2 . Olmo 2: The best fully open language model to date. https://allenai.org/blog/olmo2, 2024. Blog post

work page 2024
[141]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024 a

AI@Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024 a . URL https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

work page 2024
[142]

Llama 3.2 model card, 2024 b

AI@Meta. Llama 3.2 model card, 2024 b . URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md

work page 2024
[143]

B., Lozhkov, A., Bakouch, E., von Werra, L., and Wolf, T

Allal, L. B., Lozhkov, A., Bakouch, E., von Werra, L., and Wolf, T. Smollm - blazingly fast and remarkably powerful, 2024

work page 2024
[144]

The falcon series of open language models,

Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, \'E ., Hesslow, D., Launay, J., Malartic, Q., et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023

work page arXiv 2023

Showing first 80 references.