arxiv: 2401.02385 · v2 · submitted 2024-01-04 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang , Guangtao Zeng , Tianduo Wang , Wei Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords small language modelpretrainingLlama 21.1B parametersopen-source LLMdownstream tasksFlashAttention

0 comments

The pith

A 1.1 billion parameter model pretrained on one trillion tokens outperforms other open-source models of similar size on downstream tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TinyLlama as a 1.1B parameter language model pretrained on roughly 1 trillion tokens for about three epochs. It adopts the Llama 2 architecture and tokenizer while adding open-source efficiency improvements such as FlashAttention. The central result is that this setup yields stronger results on standard downstream benchmarks than other open-source models in the same size range. A reader would care because the finding points to data volume and training choices as levers that can make smaller models competitive without needing larger parameter counts.

Core claim

TinyLlama is a 1.1B parameter language model pretrained on approximately 1 trillion tokens using the Llama 2 architecture and tokenizer together with community optimizations including FlashAttention and Lit-GPT. The model achieves better performance across a series of downstream tasks than existing open-source language models of comparable size.

What carries the argument

TinyLlama, the 1.1B parameter model that combines the Llama 2 architecture with pretraining on 1 trillion tokens and open-source efficiency tools.

If this is right

Model size can be reduced while maintaining competitive task performance when pretraining data reaches trillions of tokens.
Open-source efficiency libraries enable practical training runs at the 1B scale.
Public release of checkpoints supports further fine-tuning and analysis by the community.
Downstream results improve measurably with increased pretraining tokens even when parameter count stays fixed at 1.1 billion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result implies that data scaling may substitute for parameter scaling in some performance regimes.
Hardware-constrained settings could benefit from prioritizing data collection over model enlargement.
Similar recipes may transfer to other modalities where large corpora exist but compute budgets are limited.

Load-bearing premise

The performance edge comes from the scale of pretraining data and chosen optimizations rather than from evaluation setup or unintended overlap between training and test data.

What would settle it

A head-to-head evaluation on the same downstream tasks in which a 1.1B model trained on substantially less data or without the listed optimizations matches or exceeds TinyLlama's scores.

read the original abstract

We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TinyLlama releases a usable 1.1B checkpoint trained on 1T tokens that beats similar open models on benchmarks, but the gains need tighter checks on contamination and eval consistency.

read the letter

The core point is that this team trained a 1.1B Llama-2-style model on around a trillion tokens for three epochs and released the weights plus code. It reports stronger downstream results than other open models of comparable size. That specific training run and public artifact is what is new here; the architecture and optimizations like FlashAttention are borrowed from existing open work, so the novelty sits in the scale and the release rather than a new method. They do a clean job of making the model and training setup available on GitHub, which lowers the cost for others to test or fine-tune small models. The engineering choices look solid and reproducible on that front. The soft spots are in the evaluation story. The abstract claims clear outperformance, yet it gives no error bars, no ablation tables, and no explicit decontamination steps for the pretraining data. If test sets overlapped with the corpus or if baseline runs used different prompts or scripts, the margins could be smaller than reported. The stress-test concern about artifacts is reasonable based on what is shown so far; the full paper would need to address this directly to make the central claim stick. This work is aimed at the open-source LLM community and groups that care about efficient, deployable models rather than frontier scaling. Readers who want a public small-model baseline or starting point for experiments will find it useful. It deserves peer review because the release itself is concrete and the training scale is worth documenting, even if the paper needs revisions on the evaluation details.

Referee Report

2 major / 2 minor

Summary. The paper presents TinyLlama, a 1.1B-parameter language model pretrained on approximately 1 trillion tokens for roughly 3 epochs. It adopts the Llama 2 architecture and tokenizer, incorporates open-source optimizations such as FlashAttention and Lit-GPT for computational efficiency, and reports that the model significantly outperforms other open-source language models of comparable size on downstream tasks. Model checkpoints and training code are released publicly on GitHub.

Significance. If the reported performance margins are robust, the work is significant as a reproducible, openly available small-scale model that demonstrates the viability of achieving competitive downstream results through large-scale pretraining (1T tokens) combined with community-driven efficiency improvements. The public release of code and weights directly supports reproducibility and further research on efficient LLMs.

major comments (2)

[Abstract and §4 (Evaluation)] Abstract and evaluation sections: the central claim of significant outperformance over comparable open-source models lacks reported error bars, standard deviations across multiple runs, or statistical significance tests on the benchmark scores. Without these, it is impossible to determine whether the observed margins reflect genuine improvements or evaluation variance.
[§3 and §4] §3 (Training) and §4: the manuscript does not describe explicit data decontamination or contamination checks between the 1T-token pretraining corpus and the downstream evaluation benchmarks. This is load-bearing for the claim that gains arise from architecture, token count, and optimizations rather than test-set overlap.

minor comments (2)

[Results tables] Table 1 or equivalent results table: include the exact number of tokens seen per model and the precise evaluation harness (e.g., prompting format, few-shot settings) used for every baseline to enable direct replication.
[§2] §2 (Related Work): add a brief comparison of training FLOPs or wall-clock time against the strongest baselines to quantify the claimed computational-efficiency gains from FlashAttention and Lit-GPT.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address each major comment below and have revised the manuscript accordingly where feasible.

read point-by-point responses

Referee: [Abstract and §4 (Evaluation)] Abstract and evaluation sections: the central claim of significant outperformance over comparable open-source models lacks reported error bars, standard deviations across multiple runs, or statistical significance tests on the benchmark scores. Without these, it is impossible to determine whether the observed margins reflect genuine improvements or evaluation variance.

Authors: We agree that error bars and statistical tests would strengthen the evaluation. However, pretraining a 1.1B model on 1T tokens required substantial compute, and we performed only a single training run. In the revised manuscript we have added an explicit statement in Section 4 noting that all reported results are from this single run and discussing the implications for variance. We also reference prior work on LLM evaluation stability. While we cannot rerun the full pretraining to obtain multiple seeds, the consistent gains across diverse benchmarks support the robustness of the findings. revision: partial
Referee: [§3 and §4] §3 (Training) and §4: the manuscript does not describe explicit data decontamination or contamination checks between the 1T-token pretraining corpus and the downstream evaluation benchmarks. This is load-bearing for the claim that gains arise from architecture, token count, and optimizations rather than test-set overlap.

Authors: We acknowledge the importance of this point. The original submission did not detail explicit decontamination checks. The pretraining corpus is SlimPajama, which applies its own deduplication and filtering. In the revised manuscript we have expanded Section 3 with additional description of the data sources and preprocessing pipeline, and added a paragraph in Section 4 explicitly noting the absence of benchmark-specific contamination analysis as a limitation. We believe the scale of training and the nature of the cleaned dataset make substantial leakage unlikely, but we agree this should be stated clearly. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training results on external benchmarks

full rationale

The paper reports an empirical outcome: pretraining a 1.1B model on ~1T tokens using the Llama 2 architecture plus open-source optimizations (FlashAttention, Lit-GPT), then measuring downstream task performance against existing open-source baselines. No equations, derivations, fitted parameters, or predictions appear; the central claim is a direct training-and-evaluate result. Self-citations are limited to the external Llama 2 paper and standard tools; none function as load-bearing uniqueness theorems or ansatzes that reduce the result to the authors' prior inputs. The work is therefore self-contained against external benchmarks with no reduction by construction.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard transformer scaling assumptions and the effectiveness of Llama 2's architecture; no new entities are postulated.

free parameters (3)

model parameter count
1.1B chosen as target size for the compact model
pretraining token count
1 trillion tokens selected as training volume
training epochs
Approximately 3 epochs over the data

axioms (2)

domain assumption Llama 2 architecture and tokenizer are suitable base for small-scale pretraining
Paper builds directly on them without re-derivation
domain assumption Open-source efficiency tools (FlashAttention, Lit-GPT) preserve model quality while improving speed
Invoked to justify computational choices

pith-pipeline@v0.9.0 · 5397 in / 1202 out tokens · 40023 ms · 2026-05-13T21:05:08.954699+00:00 · methodology

discussion (0)

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 conditional novelty 8.0

HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
cs.LG 2026-05 conditional novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation
cs.CR 2026-05 conditional novelty 7.0

Strict regex parsing of LLM security log outputs introduces systematic errors that can make functional models appear non-functional, with a 76-point accuracy gap recovered by fuzzy parsing.
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
cs.DC 2026-05 unverdicted novelty 7.0

Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.
BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration
cs.CL 2026-04 unverdicted novelty 7.0

BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods...
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
cs.SD 2026-05 unverdicted novelty 6.0

SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
cs.CR 2026-05 unverdicted novelty 6.0

PASA is a semantic-level watermarking method for LLM text that uses embedding-space clusters and synchronized randomness to remain detectable after paraphrasing while preserving text quality.
Common-agency Games for Multi-Objective Test-Time Alignment
cs.GT 2026-05 unverdicted novelty 6.0

CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 unverdicted novelty 6.0

HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
cs.RO 2026-05 unverdicted novelty 6.0

Sentinel-VLA introduces a metacognitive VLA model with a sentinel module for real-time status monitoring, dynamic reasoning, and error recovery, plus a self-evolving continual learning method, raising real-world task ...
Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reasoned Distractor Generation
cs.CL 2026-04 unverdicted novelty 6.0

LLMs prompted with few-shot examples and rationales generate better reasoned distractors for MCQs than fine-tuned contrastive models across six benchmarks.
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.
EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
cs.AR 2026-04 unverdicted novelty 6.0

A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring
cs.RO 2026-04 unverdicted novelty 6.0

A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user ...
RAGShield: Detecting Numerical Claim Manipulation in Government RAG Systems
cs.CR 2026-04 conditional novelty 6.0

RAGShield detects all numerical manipulations in government RAG systems via pattern-based value extraction and cross-source verification, achieving 0% attack success rate on 430 real IRS-derived attacks where embeddin...
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
cs.CL 2024-04 conditional novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
cs.LG 2024-01 unverdicted novelty 6.0

EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.
DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models
cs.LG 2026-05 unverdicted novelty 5.0

DP-LAC provides a new adaptive clipping technique for DP-SGD in federated LLM fine-tuning that improves accuracy by 6.6% on average without consuming additional privacy budget or requiring new hyperparameters.
TabEmb: Joint Semantic-Structure Embedding for Table Annotation
cs.LG 2026-04 unverdicted novelty 5.0

TabEmb decouples LLM-based semantic column embeddings from graph-based structural modeling to produce joint representations that improve table annotation tasks.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
cs.LG 2026-04 unverdicted novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Acceptance Dynamics Across Cognitive Domains in Speculative Decoding
cs.AI 2026-04 unverdicted novelty 5.0

Empirical measurements across four NLP domains show task type is a stronger predictor of speculative decoding acceptance than tree depth, with chat uniquely achieving expected accepted length over 1 token per step.
VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use
cs.CL 2026-05 unverdicted novelty 4.0

VectraYX-Nano is a 42M-parameter Spanish cybersecurity LLM trained with curriculum learning and native MCP tool use, achieving 0.78 conversational gate and improved tool selection with denser data.
Agentic Performance at the Edge: Insights from Benchmarking
cs.AI 2026-05 unverdicted novelty 4.0

Edge agentic AI quality is not a simple function of model size; robust results require joint design of model selection and tool integration, as revealed by domain-conditioned benchmarks showing accuracy-latency Pareto fronts.
OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis
cs.CR 2026-04 unverdicted novelty 4.0

LoRA fine-tuning of TinyLlama-1.1B on 450 SOC examples produces 68% threat classification accuracy and 58% severity accuracy on 50 held-out logs, with full code, weights, and data released.
DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs
cs.CR 2026-04 unverdicted novelty 4.0

DP-FLogTinyLLM combines federated learning, differential privacy, and LoRA-tuned tiny LLMs to match centralized log anomaly detection performance on Thunderbird and BGL datasets while preserving privacy.
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
cs.SE 2026-04 unverdicted novelty 4.0

Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
cs.DC 2026-04 unverdicted novelty 3.0

A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook
eess.SP 2026-04 unverdicted novelty 3.0

ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 28 Pith papers · 13 internal anchors

[1]

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebron, F., and Sanghai, S. (2023). GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP

work page 2023
[2]

Yu, J., Zhang, Q., Zheng, S., Zheng, C., Zhou, W., Zhou, D., Petrov, S., and Wu, Y . (2023). Palm 2 technical report

work page 2023
[3]

Biderman, S., and Welleck, S. (2023). Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631

work page arXiv 2023
[4]

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y ., Yu, B., Yuan, H., Y...

work page 2023
[5]

Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al. (2024). Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

S., Raff, E., et al

Purohit, S., Prashanth, U. S., Raff, E., et al. (2023). Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of ICML

work page 2023
[7]

Bisk, Y ., Zellers, R., Gao, J., Choi, Y ., et al. (2020). Piqa: Reasoning about physical commonsense in natural language. In Proceedings of AAAI

work page 2020
[8]

Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. In Proceedings of NeurIPS

work page 2020
[9]

K., Hong, P., Bing, L., and Poria, S

Chia, Y . K., Hong, P., Bing, L., and Poria, S. (2023). INSTRUCTEV AL: towards holistic evaluation of instruction-tuned large language models. CoRR, abs/2306.04757

work page arXiv 2023
[10]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. (2019). BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL

work page 2019
[12]

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. (2018). Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., and Stoyanov, V . (2018). XNLI: Evaluating cross-lingual sentence representations. In Riloff, E., Chiang, D.,

work page 2018
[14]

Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

N., Fan, A., Auli, M., and Grangier, D

Dauphin, Y . N., Fan, A., Auli, M., and Grangier, D. (2017). Language modeling with gated convolutional networks. In Proceedings of ICML

work page 2017
[16]

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of NAACL

work page 2019
[17]

and Sennrich, R

Emelin, D. and Sennrich, R. (2021). Wino-X: Multilingual Winograd schemas for commonsense reasoning and coreference resolution. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Proceedings of EMNLP, pages 8517–8532, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

work page 2021
[18]

Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. (2023). A framework for few-shot language model evaluation

work page 2023
[19]

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021). Measuring massive multitask language understanding. In Proceedings of ICLR

work page 2021
[20]

A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J

Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L. (2022). Training compute-optimal large language models. In Proceedings of NeurIPS. 8

work page 2022
[21]

Hu, S., Tu, Y ., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y ., Huang, Y ., Zhao, W., et al. (2024). Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[23]

Lefaudeux, B., Massa, F., Liskovich, D., Xiong, W., Caggiano, V ., Naren, S., Xu, M., Hu, J., Tintore, M., Zhang, S., Labatut, P., and Haziza, D. (2022). xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers

work page 2022
[24]

V ., and de Vries, H

Hughes, S., Wolf, T., Guha, A., Werra, L. V ., and de Vries, H. (2023). Starcoder: may the source be with you! Transactions on Machine Learning Research. Lightning-AI (2023). Lit-gpt

work page 2023
[25]

S., Chaudhary, V ., O’Horo, B., Wang, J., Zettlemoyer, L., Kozareva, Z., Diab, M., Stoyanov, V ., and Li, X

Du, J., Pasunuru, R., Shleifer, S., Koura, P. S., Chaudhary, V ., O’Horo, B., Wang, J., Zettlemoyer, L., Kozareva, Z., Diab, M., Stoyanov, V ., and Li, X. (2022). Few-shot learning with multilingual generative language models. In Goldberg, Y ., Kozareva, Z., and Zhang, Y ., editors,Proceedings of EMNLP, pages 9019–9052, Abu Dhabi, United Arab Emirates. As...

work page 2022
[26]

and Hutter, F

Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization. In Proceedings of ICLR

work page 2019
[27]

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. (2018). Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of EMNLP. OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

M., Glavaš, G., Majewska, O., Liu, Q., Vuli´c, I., and Korhonen, A

Ponti, E. M., Glavaš, G., Majewska, O., Liu, Q., Vuli´c, I., and Korhonen, A. (2020). XCOPA: A multilingual dataset for causal commonsense reasoning. In Webber, B., Cohn, T., He, Y ., and Liu, Y ., editors,Proceedings of EMNLP , pages 2362–2376, Online. Association for Computational Linguistics

work page 2020
[29]

L., Bhagavatula, C., and Choi, Y

Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y . (2021). Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106

work page 2021
[30]

Shazeer, N. (2020). GLU variants improve transformer. CoRR, abs/2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020
[31]

R., Hestness, J., and Dey, N

Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., and Dey, N. (2023). SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

work page 2023
[32]

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Su, J., Lu, Y ., Pan, S., Murtadha, A., Wen, B., and Liu, Y . (2021). Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Chi, E., Zhou, D., and Wei, J. (2023). Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of ACL. Thaddée, Y . T. (2023). Chinchilla’s death. https://espadrine.github.io/blog/posts/chinchilla-s- death.html. 9 Together Computer (2023). Redpajama: an open dataset for training large language models

work page 2023
[35]

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023a). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Bhargava, P., Bhosale, S., et al. (2023b). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Polosukhin, I. (2017). Attention is all you need. In Proceedings of NeurIPS

work page 2017
[38]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. In Proceedings of NeurIPS

work page 2022
[39]

Wei, T., Zhao, L., Zhang, L., Zhu, B., Wang, L., Yang, H., Li, B., Cheng, C., Lü, W., Hu, R., et al. (2023). Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341

work page arXiv 2023
[40]

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . (2019). HellaSwag: Can a machine really finish your sentence? In Proceedings of the ACL

work page 2019
[41]

and Sennrich, R

Zhang, B. and Sennrich, R. (2019). Root mean square layer normalization. In Proceedings of NeurIPS

work page 2019
[42]

OPT: Open Pre-trained Transformer Language Models

Lin, X. V ., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Su, T., Yang, Z., and Tang, J. (2023). Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023 , pages 5673–5684. ACM. Based on this observation, we reduced the number of tokens ...

work page 2023