Recognition: 2 theorem links
· Lean TheoremSmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Pith reviewed 2026-05-13 17:25 UTC · model grok-4.3
The pith
SmolLM2 shows a 1.7 billion parameter model can surpass other small language models by training on eleven trillion tokens of carefully mixed data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SmolLM2, a 1.7B parameter model, reaches higher performance than comparable small language models by overtraining on approximately 11 trillion tokens through a multi-stage regimen that mixes web text with math, code, and instruction data, using newly prepared datasets FineMath, Stack-Edu, and SmolTalk where prior collections proved insufficient.
What carries the argument
The multi-stage data mixing process that iteratively updates dataset proportions according to performance at the prior stage, together with the introduction of specialized datasets to address quality and quantity shortfalls.
If this is right
- Small language models can become competitive for deployment in resource-constrained settings.
- Releasing both the model and the prepared datasets enables further community experiments on efficient training.
- Data curation and iterative mixing can function as the main lever for capability gains without increasing model size.
- Later small models may follow similar stage-wise refinement of data mixtures to close gaps with larger peers.
Where Pith is reading between the lines
- Heavy reliance on data volume and quality could reduce the total compute needed to reach given performance levels.
- The same iterative mixing strategy might transfer to training efficient models in non-language domains.
- Optimal data proportions may differ systematically with model size, inviting targeted experiments on that relationship.
Load-bearing premise
Performance gains come chiefly from the multi-stage data mixing and new datasets rather than from differences in training compute, hyperparameters, or evaluation setup.
What would settle it
Training a model of the same size on the same total tokens but with standard datasets and fixed mixing rates, then measuring no improvement over the baselines on the same benchmarks, would falsify the claim.
read the original abstract
While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SmolLM2, a 1.7B-parameter language model trained on approximately 11 trillion tokens using a multi-stage process with dynamic data mixing of web text, math, code, and instruction data. The authors introduce three new datasets (FineMath, Stack-Edu, SmolTalk) to address perceived gaps in existing corpora and use small-scale ablations plus manual mixing-rate updates based on prior-stage performance to guide training. The central claim is that SmolLM2 outperforms recent comparators such as Qwen2.5-1.5B and Llama3.2-1B on standard benchmarks.
Significance. If the performance gains are shown to be driven by the data-centric choices rather than raw token volume or unstated hyperparameter differences, the work would provide a concrete, reproducible recipe for high-quality small-model training. The release of the model weights and the three new datasets is a clear strength, enabling downstream research on data curation and scaling laws for the sub-2B regime.
major comments (3)
- [§4, §5] §4 (Training) and §5 (Experiments): the paper must include a compute-matched baseline that uses the same total token count (~11T) and architecture but replaces the multi-stage mixing and new datasets with a standard web-only or fixed-ratio mix. Without this, the attribution of gains to FineMath/Stack-Edu/SmolTalk and the dynamic mixing schedule cannot be isolated from the effect of overtraining scale.
- [Table 2] Table 2 / Figure 3: the reported benchmark scores for SmolLM2 versus Qwen2.5-1.5B and Llama3.2-1B must be accompanied by the exact training-token counts and FLOPs for each comparator. If the baselines were trained on substantially fewer tokens, the performance gap cannot be credited to data curation alone.
- [§3.2] §3.2 (Ablations): the small-scale ablation experiments need to report variance across random seeds and include a control that keeps total compute fixed while varying only the presence/absence of the new datasets. Current ablations appear to conflate data quality with training duration.
minor comments (3)
- [Abstract] The abstract states 'outperforms' without quoting any numbers; the main text should move the headline numbers into the abstract or first paragraph of the introduction for immediate visibility.
- [§4] Notation for mixing rates and stage transitions is introduced informally; a single consolidated table listing per-stage token counts, mixing ratios, and learning-rate schedules would improve reproducibility.
- [Related Work] Missing reference to recent data-centric scaling papers (e.g., DataComp, Dolma) that use similar multi-stage mixing; situating the manual refinement process against those automated baselines would strengthen the contribution.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below with clarifications and indicate planned revisions.
read point-by-point responses
-
Referee: [§4, §5] §4 (Training) and §5 (Experiments): the paper must include a compute-matched baseline that uses the same total token count (~11T) and architecture but replaces the multi-stage mixing and new datasets with a standard web-only or fixed-ratio mix. Without this, the attribution of gains to FineMath/Stack-Edu/SmolTalk and the dynamic mixing schedule cannot be isolated from the effect of overtraining scale.
Authors: We agree that a full compute-matched baseline would strengthen causal attribution. However, training a second 1.7B model on 11T tokens exceeds our available compute budget. We will revise §§4–5 to explicitly acknowledge this limitation, expand the existing small-scale fixed-compute ablations that isolate dataset effects, and release training scripts so the community can run such controls. revision: partial
-
Referee: [Table 2] Table 2 / Figure 3: the reported benchmark scores for SmolLM2 versus Qwen2.5-1.5B and Llama3.2-1B must be accompanied by the exact training-token counts and FLOPs for each comparator. If the baselines were trained on substantially fewer tokens, the performance gap cannot be credited to data curation alone.
Authors: We will update Table 2 with the best publicly reported figures: Qwen2.5-1.5B on ~18T tokens and Llama3.2-1B on ~9T tokens (per Meta announcements). We will add a FLOPs column using the standard 6ND approximation and a footnote discussing comparison caveats when exact proprietary counts are unavailable. revision: yes
-
Referee: [§3.2] §3.2 (Ablations): the small-scale ablation experiments need to report variance across random seeds and include a control that keeps total compute fixed while varying only the presence/absence of the new datasets. Current ablations appear to conflate data quality with training duration.
Authors: We will revise §3.2 to report means and standard deviations over three random seeds. We will also add a new fixed-token-count control that substitutes equivalent volumes of general web data for FineMath/Stack-Edu/SmolTalk, thereby isolating data-quality effects from training duration. revision: yes
- A full-scale 11T-token compute-matched baseline using only standard web data, which would require prohibitive additional compute.
Circularity Check
No significant circularity in empirical training and evaluation
full rationale
The paper describes an empirical process of multi-stage training on ~11T tokens with data mixing and new datasets (FineMath, Stack-Edu, SmolTalk), evaluated via measured performance on external benchmarks. No equations, fitted parameters, or derivations are presented that would make the outperformance claim equivalent to the training inputs by construction. Small-scale ablations and manual mixing-rate updates based on prior-stage results are iterative design steps, not self-referential predictions. The central claim rests on direct comparisons to other models rather than any self-citation chain or uniqueness theorem that reduces to the authors' prior work.
Axiom & Free-Parameter Ledger
free parameters (1)
- data mixing rates
axioms (1)
- standard math Next-token prediction is an effective objective for language modeling
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclearwe overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk)
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclearwe demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B
Forward citations
Cited by 24 Pith papers
-
Model Spec Midtraining: Improving How Alignment Training Generalizes
Model spec midtraining trains AI models on documents about their alignment rules before demonstration fine-tuning, producing stronger and more controllable generalization to the intended values and safety behaviors.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
-
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
Quantamination: Dynamic Quantization Leaks Your Data Across the Batch
Dynamic quantization creates side channels allowing partial or full recovery of other users' batched data in at least four popular ML frameworks.
-
A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation
A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
-
Internalized Reasoning for Long-Context Visual Document Understanding
A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.
-
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
-
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
-
Efficient Pre-Training with Token Superposition
Token superposition in an initial training phase followed by recovery allows large language models to reach target loss with substantially less total compute.
-
What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
-
6G Needs Agents: Toward Agentic AI-Native Networks for Autonomous Intelligence
6G networks need LLM-based agents in a layered semantic control plane to achieve autonomous intelligence, with empirical results showing that heterogeneous deployment across device-edge-core is required due to inheren...
-
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...
-
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds
Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...
-
Dream 7B: Diffusion Large Language Models
Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
-
Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback
SkillFormer, PATS, and ProfVLM deliver state-of-the-art multi-view proficiency estimation on Ego-Exo4D with up to 20x fewer parameters by combining selective fusion, dense sampling, and generative feedback.
-
Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?
Injecting 1% synthetic data targeting specific constructions during pre-training of GPT-2 Small boosts performance on 8 of 9 weakest BLiMP paradigms (e.g., only_npi_scope from 20.9% to 69.4%), while aggregate performa...
-
TinyMU: A Compact Audio-Language Model for Music Understanding
TinyMU is a 229M-parameter compact music understanding model that achieves 82% of state-of-the-art large audio-language model performance on the MuChoMusic benchmark while being 35 times smaller.
Reference graph
Works this paper leans on
-
[1]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. ArXiv , year=
-
[2]
Penedo, Guilherme and Malartic, Quentin and Hesslow, Daniel and Cojocaru, Ruxandra and Alobeidli, Hamza and Cappelli, Alessandro and Pannier, Baptiste and Almazrouei, Ebtesam and Launay, Julien , booktitle=. The
-
[3]
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research , author=. 2024 , eprint=
work page 2024
- [4]
-
[5]
Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan , title =
-
[6]
Together Computer , title =
- [7]
-
[8]
OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants , author =. 2023 , publisher =
work page 2023
- [16]
-
[18]
Advances in Neural Information Processing Systems , volume=
Solving quantitative reasoning problems with language models , author=. Advances in Neural Information Processing Systems , volume=
-
[21]
Mugglemath: Assessing the impact of query and response augmentation on math reasoning , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[22]
MathPile: A Billion-Token-Scale Pretraining Corpus for Math , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track year=
-
[25]
SmolLM - blazingly fast and remarkably powerful , author=
-
[26]
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models , author=. 2024 , url =
work page 2024
-
[27]
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
work page 2023
-
[29]
Language Models are Unsupervised Multitask Learners , author=
-
[34]
Journal of Machine Learning Research , volume=
Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=
-
[41]
OLMo 2: The best fully open language model to date , author=. 2024 , howpublished=
work page 2024
- [43]
-
[49]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=
-
[50]
PIQA: Reasoning about Physical Commonsense in Natural Language , author=. 2019 , eprint=
work page 2019
-
[51]
SocialIQA: Commonsense Reasoning about Social Interactions , author=. 2019 , eprint=
work page 2019
-
[52]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. 2019 , eprint=
work page 2019
-
[55]
Program Synthesis with Large Language Models
Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=
work page 2018
-
[57]
Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=
work page 2021
-
[58]
Go smol or go home , author =
-
[61]
Journal of machine learning research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
-
[69]
RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback , author=
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback , author=. Forty-first International Conference on Machine Learning year=
-
[71]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[72]
Advances in Neural Information Processing Systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=
-
[74]
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author=. EMNLP , year=
-
[75]
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards , author=. 2024 , booktitle=
work page 2024
- [78]
-
[79]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies , author=. 2024 , eprint=
work page 2024
-
[80]
On the resemblance and containment of documents , author=. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171) , year=
work page 1997
-
[83]
Barbaresi, Adrien. Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations
-
[84]
Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast , booktitle =
-
[85]
Multilingual E5 Text Embeddings: A Technical Report
Multilingual E5 Text Embeddings: A Technical Report , author=. arXiv preprint arXiv:2402.05672 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[88]
Bartolom. Distilabel: An. GitHub repository , howpublished =. 2024 , publisher =
work page 2024
-
[89]
Advances in Neural Information Processing Systems , volume=
Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=
-
[90]
The Twelfth International Conference on Learning Representations , year=
The unlocking spell on base llms: Rethinking alignment via in-context learning , author=. The Twelfth International Conference on Learning Representations , year=
-
[91]
and Kaiser, ukasz and Polosukhin, Illia , booktitle =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , year =
-
[92]
Loshchilov, Ilya and Hutter, Frank , journal=
-
[94]
Scaling Laws for Neural Language Models
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[96]
How to Train Long-Context Language Models (Effectively) , author=. 2024 , eprint=
work page 2024
-
[97]
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly , author=. 2024 , eprint=
work page 2024
-
[99]
OpenAI GPTBot Documentation , howpublished =
-
[100]
ClaudeBot Documentation , howpublished =
-
[101]
Common Crawl , howpublished =
- [102]
-
[103]
Forty-first International Conference on Machine Learning , year=
ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback , author=. Forty-first International Conference on Machine Learning , year=
-
[104]
Kaokao Lv and Wenxin Zhang and Haihao Shen , title =. 2023 , howpublished =
work page 2023
-
[108]
Alpacaeval: An automatic evaluator of instruction-following models , author=
-
[109]
Advances in Neural Information Processing Systems , volume=
Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in Neural Information Processing Systems , volume=
-
[110]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Challenging big-bench tasks and whether chain-of-thought can solve them , author=. arXiv preprint arXiv:2210.09261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[111]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[112]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Rewritelm: An instruction-tuned large language model for text rewriting , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[114]
SelfCodeAlign: Self-Alignment for Code Generation , author=. 2024 , journal=
work page 2024
- [116]
-
[119]
Hugging Face repository , howpublished =
Jia Li and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. Hugging Face repository , howpublished =. 2024 , publisher =
work page 2024
-
[120]
Mammoth: Building math generalist models through hybrid instruction tuning
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning , author=. arXiv preprint arXiv:2309.05653 , year=
- [121]
-
[122]
Scaling Synthetic Data Creation with 1,000,000,000 Personas , author=. 2024 , eprint=
work page 2024
-
[126]
Transactions of the Association for Computational Linguistics , volume=
Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=
- [127]
- [129]
-
[130]
Transactions of the Association for Computational Linguistics , volume=
Coqa: A conversational question answering challenge , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=
work page 2019
- [133]
-
[134]
https://darkvisitors.com/agents/claudebot
Claudebot documentation. https://darkvisitors.com/agents/claudebot. Accessed: 2024-06-05
work page 2024
-
[135]
Common crawl. https://commoncrawl.org/. Accessed: 2024-06-05
work page 2024
-
[136]
https://platform.openai.com/docs/gptbot
Openai gptbot documentation. https://platform.openai.com/docs/gptbot. Accessed: 2024-06-05
work page 2024
-
[137]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., Behl, H., et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[138]
Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024 b
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[139]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[140]
Olmo 2: The best fully open language model to date
Ai2 . Olmo 2: The best fully open language model to date. https://allenai.org/blog/olmo2, 2024. Blog post
work page 2024
-
[141]
Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024 a
AI@Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024 a . URL https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
work page 2024
-
[142]
AI@Meta. Llama 3.2 model card, 2024 b . URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md
work page 2024
-
[143]
B., Lozhkov, A., Bakouch, E., von Werra, L., and Wolf, T
Allal, L. B., Lozhkov, A., Bakouch, E., von Werra, L., and Wolf, T. Smollm - blazingly fast and remarkably powerful, 2024
work page 2024
-
[144]
The falcon series of open language models.arXiv preprint arXiv:2311.16867, 2023
Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, \'E ., Hesslow, D., Launay, J., Malartic, Q., et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.