pith. machine review for the scientific record. sign in

arxiv: 2502.02737 · v1 · submitted 2025-02-04 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:25 UTC · model grok-4.3

classification 💻 cs.CL
keywords small language modelsdata-centric trainingdataset mixingspecialized datasetsinstruction followingmodel performance
0
0 comments X

The pith

SmolLM2 shows a 1.7 billion parameter model can surpass other small language models by training on eleven trillion tokens of carefully mixed data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper documents the development of SmolLM2, a 1.7 billion parameter language model trained on roughly eleven trillion tokens. It uses a multi-stage process that combines general web text with specialized math, code, and instruction-following data. New datasets are created to fill gaps in existing resources for mathematics, code education, and conversations. Dataset mixing rates are adjusted at each stage based on observed performance. This data-centric approach produces a model that outperforms recent small models such as Qwen2.5-1.5B and Llama3.2-1B.

Core claim

SmolLM2, a 1.7B parameter model, reaches higher performance than comparable small language models by overtraining on approximately 11 trillion tokens through a multi-stage regimen that mixes web text with math, code, and instruction data, using newly prepared datasets FineMath, Stack-Edu, and SmolTalk where prior collections proved insufficient.

What carries the argument

The multi-stage data mixing process that iteratively updates dataset proportions according to performance at the prior stage, together with the introduction of specialized datasets to address quality and quantity shortfalls.

If this is right

  • Small language models can become competitive for deployment in resource-constrained settings.
  • Releasing both the model and the prepared datasets enables further community experiments on efficient training.
  • Data curation and iterative mixing can function as the main lever for capability gains without increasing model size.
  • Later small models may follow similar stage-wise refinement of data mixtures to close gaps with larger peers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Heavy reliance on data volume and quality could reduce the total compute needed to reach given performance levels.
  • The same iterative mixing strategy might transfer to training efficient models in non-language domains.
  • Optimal data proportions may differ systematically with model size, inviting targeted experiments on that relationship.

Load-bearing premise

Performance gains come chiefly from the multi-stage data mixing and new datasets rather than from differences in training compute, hyperparameters, or evaluation setup.

What would settle it

Training a model of the same size on the same total tokens but with standard datasets and fixed mixing rates, then measuring no improvement over the baselines on the same benchmarks, would falsify the claim.

read the original abstract

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents SmolLM2, a 1.7B-parameter language model trained on approximately 11 trillion tokens using a multi-stage process with dynamic data mixing of web text, math, code, and instruction data. The authors introduce three new datasets (FineMath, Stack-Edu, SmolTalk) to address perceived gaps in existing corpora and use small-scale ablations plus manual mixing-rate updates based on prior-stage performance to guide training. The central claim is that SmolLM2 outperforms recent comparators such as Qwen2.5-1.5B and Llama3.2-1B on standard benchmarks.

Significance. If the performance gains are shown to be driven by the data-centric choices rather than raw token volume or unstated hyperparameter differences, the work would provide a concrete, reproducible recipe for high-quality small-model training. The release of the model weights and the three new datasets is a clear strength, enabling downstream research on data curation and scaling laws for the sub-2B regime.

major comments (3)
  1. [§4, §5] §4 (Training) and §5 (Experiments): the paper must include a compute-matched baseline that uses the same total token count (~11T) and architecture but replaces the multi-stage mixing and new datasets with a standard web-only or fixed-ratio mix. Without this, the attribution of gains to FineMath/Stack-Edu/SmolTalk and the dynamic mixing schedule cannot be isolated from the effect of overtraining scale.
  2. [Table 2] Table 2 / Figure 3: the reported benchmark scores for SmolLM2 versus Qwen2.5-1.5B and Llama3.2-1B must be accompanied by the exact training-token counts and FLOPs for each comparator. If the baselines were trained on substantially fewer tokens, the performance gap cannot be credited to data curation alone.
  3. [§3.2] §3.2 (Ablations): the small-scale ablation experiments need to report variance across random seeds and include a control that keeps total compute fixed while varying only the presence/absence of the new datasets. Current ablations appear to conflate data quality with training duration.
minor comments (3)
  1. [Abstract] The abstract states 'outperforms' without quoting any numbers; the main text should move the headline numbers into the abstract or first paragraph of the introduction for immediate visibility.
  2. [§4] Notation for mixing rates and stage transitions is introduced informally; a single consolidated table listing per-stage token counts, mixing ratios, and learning-rate schedules would improve reproducibility.
  3. [Related Work] Missing reference to recent data-centric scaling papers (e.g., DataComp, Dolma) that use similar multi-stage mixing; situating the manual refinement process against those automated baselines would strengthen the contribution.

Simulated Author's Rebuttal

3 responses · 1 unresolved

Thank you for the constructive feedback. We address each major comment below with clarifications and indicate planned revisions.

read point-by-point responses
  1. Referee: [§4, §5] §4 (Training) and §5 (Experiments): the paper must include a compute-matched baseline that uses the same total token count (~11T) and architecture but replaces the multi-stage mixing and new datasets with a standard web-only or fixed-ratio mix. Without this, the attribution of gains to FineMath/Stack-Edu/SmolTalk and the dynamic mixing schedule cannot be isolated from the effect of overtraining scale.

    Authors: We agree that a full compute-matched baseline would strengthen causal attribution. However, training a second 1.7B model on 11T tokens exceeds our available compute budget. We will revise §§4–5 to explicitly acknowledge this limitation, expand the existing small-scale fixed-compute ablations that isolate dataset effects, and release training scripts so the community can run such controls. revision: partial

  2. Referee: [Table 2] Table 2 / Figure 3: the reported benchmark scores for SmolLM2 versus Qwen2.5-1.5B and Llama3.2-1B must be accompanied by the exact training-token counts and FLOPs for each comparator. If the baselines were trained on substantially fewer tokens, the performance gap cannot be credited to data curation alone.

    Authors: We will update Table 2 with the best publicly reported figures: Qwen2.5-1.5B on ~18T tokens and Llama3.2-1B on ~9T tokens (per Meta announcements). We will add a FLOPs column using the standard 6ND approximation and a footnote discussing comparison caveats when exact proprietary counts are unavailable. revision: yes

  3. Referee: [§3.2] §3.2 (Ablations): the small-scale ablation experiments need to report variance across random seeds and include a control that keeps total compute fixed while varying only the presence/absence of the new datasets. Current ablations appear to conflate data quality with training duration.

    Authors: We will revise §3.2 to report means and standard deviations over three random seeds. We will also add a new fixed-token-count control that substitutes equivalent volumes of general web data for FineMath/Stack-Edu/SmolTalk, thereby isolating data-quality effects from training duration. revision: yes

standing simulated objections not resolved
  • A full-scale 11T-token compute-matched baseline using only standard web data, which would require prohibitive additional compute.

Circularity Check

0 steps flagged

No significant circularity in empirical training and evaluation

full rationale

The paper describes an empirical process of multi-stage training on ~11T tokens with data mixing and new datasets (FineMath, Stack-Edu, SmolTalk), evaluated via measured performance on external benchmarks. No equations, fitted parameters, or derivations are presented that would make the outperformance claim equivalent to the training inputs by construction. Small-scale ablations and manual mixing-rate updates based on prior-stage results are iterative design steps, not self-referential predictions. The central claim rests on direct comparisons to other models rather than any self-citation chain or uniqueness theorem that reduces to the authors' prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that next-token prediction on the described data mixture produces the reported benchmark improvements; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)
  • data mixing rates
    Updated manually after each stage based on prior performance; exact values not stated in abstract.
axioms (1)
  • standard math Next-token prediction is an effective objective for language modeling
    Implicit foundation of all autoregressive LM training described.

pith-pipeline@v0.9.0 · 5642 in / 1232 out tokens · 35317 ms · 2026-05-13T17:25:11.114099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk)

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Model Spec Midtraining: Improving How Alignment Training Generalizes

    cs.AI 2026-05 unverdicted novelty 8.0

    Model spec midtraining trains AI models on documents about their alignment rules before demonstration fine-tuning, producing stronger and more controllable generalization to the intended values and safety behaviors.

  2. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  3. K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

    cs.CL 2026-05 conditional novelty 7.0

    K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.

  4. When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

    cs.PF 2026-05 unverdicted novelty 7.0

    A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.

  5. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  6. Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

    cs.CR 2026-04 conditional novelty 7.0

    Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.

  7. A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

    cs.IR 2026-04 unverdicted novelty 7.0

    A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.

  8. Internalized Reasoning for Long-Context Visual Document Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.

  9. Early Data Exposure Improves Robustness to Subsequent Fine-Tuning

    cs.LG 2026-05 conditional novelty 6.0

    Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.

  10. Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.

  11. Efficient Pre-Training with Token Superposition

    cs.CL 2026-05 unverdicted novelty 6.0

    Token superposition in an initial training phase followed by recovery allows large language models to reach target loss with substantially less total compute.

  12. What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.

  13. 6G Needs Agents: Toward Agentic AI-Native Networks for Autonomous Intelligence

    cs.NI 2026-05 unverdicted novelty 6.0

    6G networks need LLM-based agents in a layered semantic control plane to achieve autonomous intelligence, with empirical results showing that heterogeneous deployment across device-edge-core is required due to inheren...

  14. AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

    cs.AI 2026-05 unverdicted novelty 6.0

    Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.

  15. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  16. EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models

    cs.AR 2026-04 unverdicted novelty 6.0

    A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...

  17. Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds

    cs.CL 2026-04 unverdicted novelty 6.0

    Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...

  18. Dream 7B: Diffusion Large Language Models

    cs.CL 2025-08 unverdicted novelty 6.0

    Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...

  19. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  20. SmolVLM: Redefining small and efficient multimodal models

    cs.AI 2025-04 unverdicted novelty 6.0

    SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

  21. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    cs.RO 2025-03 unverdicted novelty 6.0

    GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

  22. Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

    cs.CV 2026-05 unverdicted novelty 5.0

    SkillFormer, PATS, and ProfVLM deliver state-of-the-art multi-view proficiency estimation on Ego-Exo4D with up to 20x fewer parameters by combining selective fusion, dense sampling, and generative feedback.

  23. Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?

    cs.CL 2026-04 conditional novelty 5.0

    Injecting 1% synthetic data targeting specific constructions during pre-training of GPT-2 Small boosts performance on 8 of 9 weakest BLiMP paradigms (e.g., only_npi_scope from 20.9% to 69.4%), while aggregate performa...

  24. TinyMU: A Compact Audio-Language Model for Music Understanding

    cs.SD 2026-04 unverdicted novelty 5.0

    TinyMU is a 229M-parameter compact music understanding model that achieves 82% of state-of-the-art large audio-language model performance on the MuChoMusic benchmark while being 35 times smaller.

Reference graph

Works this paper leans on

189 extracted references · 189 canonical work pages · cited by 24 Pith papers · 38 internal anchors

  1. [1]

    ArXiv , year=

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. ArXiv , year=

  2. [2]

    Penedo, Guilherme and Malartic, Quentin and Hesslow, Daniel and Cojocaru, Ruxandra and Alobeidli, Hamza and Cappelli, Alessandro and Pannier, Baptiste and Almazrouei, Ebtesam and Launay, Julien , booktitle=. The

  3. [3]

    2024 , eprint=

    Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research , author=. 2024 , eprint=

  4. [4]

    2023 , eprint=

    Scaling Data-Constrained Language Models , author=. 2023 , eprint=

  5. [5]

    Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan , title =

  6. [6]

    Together Computer , title =

  7. [7]

    Datacomp-

    Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and others , journal=. Datacomp-

  8. [8]

    2023 , publisher =

    OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants , author =. 2023 , publisher =

  9. [16]

    ArXiv , year=

    Qwen2 Technical Report , author=. ArXiv , year=

  10. [18]

    Advances in Neural Information Processing Systems , volume=

    Solving quantitative reasoning problems with language models , author=. Advances in Neural Information Processing Systems , volume=

  11. [21]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Mugglemath: Assessing the impact of query and response augmentation on math reasoning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  12. [22]

    The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track year=

    MathPile: A Billion-Token-Scale Pretraining Corpus for Math , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track year=

  13. [25]

    SmolLM - blazingly fast and remarkably powerful , author=

  14. [26]

    2024 , url =

    Llama 3.2: Revolutionizing edge AI and vision with open, customizable models , author=. 2024 , url =

  15. [27]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  16. [29]

    Language Models are Unsupervised Multitask Learners , author=

  17. [34]

    Journal of Machine Learning Research , volume=

    Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=

  18. [41]

    2024 , howpublished=

    OLMo 2: The best fully open language model to date , author=. 2024 , howpublished=

  19. [43]

    2024 , url =

    Llama 3.2 Model Card , author=. 2024 , url =

  20. [49]

    EMNLP , year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

  21. [50]

    2019 , eprint=

    PIQA: Reasoning about Physical Commonsense in Natural Language , author=. 2019 , eprint=

  22. [51]

    2019 , eprint=

    SocialIQA: Commonsense Reasoning about Social Interactions , author=. 2019 , eprint=

  23. [52]

    2019 , eprint=

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=

  24. [55]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

  25. [56]

    2018 , eprint=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

  26. [57]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  27. [58]

    Go smol or go home , author =

  28. [61]

    Journal of machine learning research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

  29. [69]

    RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback , author=

    RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback , author=. Forty-first International Conference on Machine Learning year=

  30. [71]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  31. [72]

    Advances in Neural Information Processing Systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

  32. [74]

    EMNLP , year=

    Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author=. EMNLP , year=

  33. [75]

    2024 , booktitle=

    Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards , author=. 2024 , booktitle=

  34. [78]

    2022 , eprint=

    Scaling Vision Transformers , author=. 2022 , eprint=

  35. [79]

    2024 , eprint=

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies , author=. 2024 , eprint=

  36. [80]

    Proceedings

    On the resemblance and containment of documents , author=. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171) , year=

  37. [83]

    Barbaresi, Adrien. Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

  38. [84]

    Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast , booktitle =

  39. [85]

    Multilingual E5 Text Embeddings: A Technical Report

    Multilingual E5 Text Embeddings: A Technical Report , author=. arXiv preprint arXiv:2402.05672 , year=

  40. [88]

    Distilabel: An

    Bartolom. Distilabel: An. GitHub repository , howpublished =. 2024 , publisher =

  41. [89]

    Advances in Neural Information Processing Systems , volume=

    Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

  42. [90]

    The Twelfth International Conference on Learning Representations , year=

    The unlocking spell on base llms: Rethinking alignment via in-context learning , author=. The Twelfth International Conference on Learning Representations , year=

  43. [91]

    and Kaiser, ukasz and Polosukhin, Illia , booktitle =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , year =

  44. [92]

    Loshchilov, Ilya and Hutter, Frank , journal=

  45. [94]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  46. [96]

    2024 , eprint=

    How to Train Long-Context Language Models (Effectively) , author=. 2024 , eprint=

  47. [97]

    2024 , eprint=

    HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly , author=. 2024 , eprint=

  48. [99]

    OpenAI GPTBot Documentation , howpublished =

  49. [100]

    ClaudeBot Documentation , howpublished =

  50. [101]

    Common Crawl , howpublished =

  51. [102]

    , author=

    Amplify-Instruct: Synthetically Generated Diverse Multi-turn Conversations for efficient LLM Training. , author=. arXiv preprint arXiv:(coming soon) , url=

  52. [103]

    Forty-first International Conference on Machine Learning , year=

    ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback , author=. Forty-first International Conference on Machine Learning , year=

  53. [104]

    2023 , howpublished =

    Kaokao Lv and Wenxin Zhang and Haihao Shen , title =. 2023 , howpublished =

  54. [108]

    Alpacaeval: An automatic evaluator of instruction-following models , author=

  55. [109]

    Advances in Neural Information Processing Systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in Neural Information Processing Systems , volume=

  56. [110]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=

  57. [111]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

  58. [112]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Rewritelm: An instruction-tuned large language model for text rewriting , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  59. [114]

    2024 , journal=

    SelfCodeAlign: Self-Alignment for Code Generation , author=. 2024 , journal=

  60. [116]

    2024 , howpublished =

    Cognitive Computations , title =. 2024 , howpublished =

  61. [119]

    Hugging Face repository , howpublished =

    Jia Li and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. Hugging Face repository , howpublished =. 2024 , publisher =

  62. [120]

    arXiv preprint arXiv:2309.05653 , year=

    MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning , author=. arXiv preprint arXiv:2309.05653 , year=

  63. [121]

    2024 , howpublished =

    Argilla , title =. 2024 , howpublished =

  64. [122]

    2024 , eprint=

    Scaling Synthetic Data Creation with 1,000,000,000 Personas , author=. 2024 , eprint=

  65. [126]

    Transactions of the Association for Computational Linguistics , volume=

    Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=

  66. [127]

    2024 , howpublished =

    Hugging Face , title =. 2024 , howpublished =

  67. [129]

    2024 , howpublished =

    Garrett Kamradt , title =. 2024 , howpublished =

  68. [130]

    Transactions of the Association for Computational Linguistics , volume=

    Coqa: A conversational question answering challenge , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

  69. [133]

    2024 , url =

    MosaicML , title =. 2024 , url =

  70. [134]

    https://darkvisitors.com/agents/claudebot

    Claudebot documentation. https://darkvisitors.com/agents/claudebot. Accessed: 2024-06-05

  71. [135]

    https://commoncrawl.org/

    Common crawl. https://commoncrawl.org/. Accessed: 2024-06-05

  72. [136]

    https://platform.openai.com/docs/gptbot

    Openai gptbot documentation. https://platform.openai.com/docs/gptbot. Accessed: 2024-06-05

  73. [137]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., Behl, H., et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024 a

  74. [138]

    Phi-4 Technical Report

    Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024 b

  75. [139]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  76. [140]

    Olmo 2: The best fully open language model to date

    Ai2 . Olmo 2: The best fully open language model to date. https://allenai.org/blog/olmo2, 2024. Blog post

  77. [141]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024 a

    AI@Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024 a . URL https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

  78. [142]

    Llama 3.2 model card, 2024 b

    AI@Meta. Llama 3.2 model card, 2024 b . URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md

  79. [143]

    B., Lozhkov, A., Bakouch, E., von Werra, L., and Wolf, T

    Allal, L. B., Lozhkov, A., Bakouch, E., von Werra, L., and Wolf, T. Smollm - blazingly fast and remarkably powerful, 2024

  80. [144]

    The falcon series of open language models,

    Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, \'E ., Hesslow, D., Launay, J., Malartic, Q., et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023

Showing first 80 references.