pith. machine review for the scientific record. sign in

arxiv: 2401.02385 · v2 · submitted 2024-01-04 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

TinyLlama: An Open-Source Small Language Model

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords small language modelpretrainingLlama 21.1B parametersopen-source LLMdownstream tasksFlashAttention
0
0 comments X

The pith

A 1.1 billion parameter model pretrained on one trillion tokens outperforms other open-source models of similar size on downstream tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TinyLlama as a 1.1B parameter language model pretrained on roughly 1 trillion tokens for about three epochs. It adopts the Llama 2 architecture and tokenizer while adding open-source efficiency improvements such as FlashAttention. The central result is that this setup yields stronger results on standard downstream benchmarks than other open-source models in the same size range. A reader would care because the finding points to data volume and training choices as levers that can make smaller models competitive without needing larger parameter counts.

Core claim

TinyLlama is a 1.1B parameter language model pretrained on approximately 1 trillion tokens using the Llama 2 architecture and tokenizer together with community optimizations including FlashAttention and Lit-GPT. The model achieves better performance across a series of downstream tasks than existing open-source language models of comparable size.

What carries the argument

TinyLlama, the 1.1B parameter model that combines the Llama 2 architecture with pretraining on 1 trillion tokens and open-source efficiency tools.

If this is right

  • Model size can be reduced while maintaining competitive task performance when pretraining data reaches trillions of tokens.
  • Open-source efficiency libraries enable practical training runs at the 1B scale.
  • Public release of checkpoints supports further fine-tuning and analysis by the community.
  • Downstream results improve measurably with increased pretraining tokens even when parameter count stays fixed at 1.1 billion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result implies that data scaling may substitute for parameter scaling in some performance regimes.
  • Hardware-constrained settings could benefit from prioritizing data collection over model enlargement.
  • Similar recipes may transfer to other modalities where large corpora exist but compute budgets are limited.

Load-bearing premise

The performance edge comes from the scale of pretraining data and chosen optimizations rather than from evaluation setup or unintended overlap between training and test data.

What would settle it

A head-to-head evaluation on the same downstream tasks in which a 1.1B model trained on substantially less data or without the listed optimizations matches or exceeds TinyLlama's scores.

read the original abstract

We present TinyLlama, a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents TinyLlama, a 1.1B-parameter language model pretrained on approximately 1 trillion tokens for roughly 3 epochs. It adopts the Llama 2 architecture and tokenizer, incorporates open-source optimizations such as FlashAttention and Lit-GPT for computational efficiency, and reports that the model significantly outperforms other open-source language models of comparable size on downstream tasks. Model checkpoints and training code are released publicly on GitHub.

Significance. If the reported performance margins are robust, the work is significant as a reproducible, openly available small-scale model that demonstrates the viability of achieving competitive downstream results through large-scale pretraining (1T tokens) combined with community-driven efficiency improvements. The public release of code and weights directly supports reproducibility and further research on efficient LLMs.

major comments (2)
  1. [Abstract and §4 (Evaluation)] Abstract and evaluation sections: the central claim of significant outperformance over comparable open-source models lacks reported error bars, standard deviations across multiple runs, or statistical significance tests on the benchmark scores. Without these, it is impossible to determine whether the observed margins reflect genuine improvements or evaluation variance.
  2. [§3 and §4] §3 (Training) and §4: the manuscript does not describe explicit data decontamination or contamination checks between the 1T-token pretraining corpus and the downstream evaluation benchmarks. This is load-bearing for the claim that gains arise from architecture, token count, and optimizations rather than test-set overlap.
minor comments (2)
  1. [Results tables] Table 1 or equivalent results table: include the exact number of tokens seen per model and the precise evaluation harness (e.g., prompting format, few-shot settings) used for every baseline to enable direct replication.
  2. [§2] §2 (Related Work): add a brief comparison of training FLOPs or wall-clock time against the strongest baselines to quantify the claimed computational-efficiency gains from FlashAttention and Lit-GPT.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address each major comment below and have revised the manuscript accordingly where feasible.

read point-by-point responses
  1. Referee: [Abstract and §4 (Evaluation)] Abstract and evaluation sections: the central claim of significant outperformance over comparable open-source models lacks reported error bars, standard deviations across multiple runs, or statistical significance tests on the benchmark scores. Without these, it is impossible to determine whether the observed margins reflect genuine improvements or evaluation variance.

    Authors: We agree that error bars and statistical tests would strengthen the evaluation. However, pretraining a 1.1B model on 1T tokens required substantial compute, and we performed only a single training run. In the revised manuscript we have added an explicit statement in Section 4 noting that all reported results are from this single run and discussing the implications for variance. We also reference prior work on LLM evaluation stability. While we cannot rerun the full pretraining to obtain multiple seeds, the consistent gains across diverse benchmarks support the robustness of the findings. revision: partial

  2. Referee: [§3 and §4] §3 (Training) and §4: the manuscript does not describe explicit data decontamination or contamination checks between the 1T-token pretraining corpus and the downstream evaluation benchmarks. This is load-bearing for the claim that gains arise from architecture, token count, and optimizations rather than test-set overlap.

    Authors: We acknowledge the importance of this point. The original submission did not detail explicit decontamination checks. The pretraining corpus is SlimPajama, which applies its own deduplication and filtering. In the revised manuscript we have expanded Section 3 with additional description of the data sources and preprocessing pipeline, and added a paragraph in Section 4 explicitly noting the absence of benchmark-specific contamination analysis as a limitation. We believe the scale of training and the nature of the cleaned dataset make substantial leakage unlikely, but we agree this should be stated clearly. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training results on external benchmarks

full rationale

The paper reports an empirical outcome: pretraining a 1.1B model on ~1T tokens using the Llama 2 architecture plus open-source optimizations (FlashAttention, Lit-GPT), then measuring downstream task performance against existing open-source baselines. No equations, derivations, fitted parameters, or predictions appear; the central claim is a direct training-and-evaluate result. Self-citations are limited to the external Llama 2 paper and standard tools; none function as load-bearing uniqueness theorems or ansatzes that reduce the result to the authors' prior inputs. The work is therefore self-contained against external benchmarks with no reduction by construction.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard transformer scaling assumptions and the effectiveness of Llama 2's architecture; no new entities are postulated.

free parameters (3)
  • model parameter count
    1.1B chosen as target size for the compact model
  • pretraining token count
    1 trillion tokens selected as training volume
  • training epochs
    Approximately 3 epochs over the data
axioms (2)
  • domain assumption Llama 2 architecture and tokenizer are suitable base for small-scale pretraining
    Paper builds directly on them without re-derivation
  • domain assumption Open-source efficiency tools (FlashAttention, Lit-GPT) preserve model quality while improving speed
    Invoked to justify computational choices

pith-pipeline@v0.9.0 · 5397 in / 1202 out tokens · 40023 ms · 2026-05-13T21:05:08.954699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 conditional novelty 8.0

    HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.

  2. Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

    cs.LG 2026-05 conditional novelty 7.0

    A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.

  3. When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation

    cs.CR 2026-05 conditional novelty 7.0

    Strict regex parsing of LLM security log outputs introduces systematic errors that can make functional models appear non-functional, with a 76-point accuracy gap recovered by fuzzy parsing.

  4. Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

    cs.DC 2026-05 unverdicted novelty 7.0

    Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.

  5. BoostTaxo: Zero-Shot Taxonomy Induction via Boosting-Style Agentic Reasoning and Constraint-Aware Calibration

    cs.CL 2026-04 unverdicted novelty 7.0

    BoostTaxo introduces a boosting-style LLM framework for zero-shot taxonomy induction that uses hybrid candidate selection and constraint-aware calibration to achieve superior or comparable performance to prior methods...

  6. SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

    cs.SD 2026-05 unverdicted novelty 6.0

    SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.

  7. PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

    cs.CR 2026-05 unverdicted novelty 6.0

    PASA is a semantic-level watermarking method for LLM text that uses embedding-space clusters and synchronized randomness to remain detectable after paraphrasing while preserving text quality.

  8. Common-agency Games for Multi-Objective Test-Time Alignment

    cs.GT 2026-05 unverdicted novelty 6.0

    CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.

  9. UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 6.0

    A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

  10. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.

  11. Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery

    cs.RO 2026-05 unverdicted novelty 6.0

    Sentinel-VLA introduces a metacognitive VLA model with a sentinel module for real-time status monitoring, dynamic reasoning, and error recovery, plus a self-evolving continual learning method, raising real-world task ...

  12. Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reasoned Distractor Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs prompted with few-shot examples and rationales generate better reasoned distractors for MCQs than fine-tuned contrastive models across six benchmarks.

  13. StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.

  14. EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models

    cs.AR 2026-04 unverdicted novelty 6.0

    A CIM-based hardware-software co-design in 65nm achieves up to 7.3x higher throughput and 49.59x better energy efficiency than NVIDIA Orin Nano for LLaMA3.2-1B, averaging 336 tokens/s and 173 tokens/J under INT4 acros...

  15. A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring

    cs.RO 2026-04 unverdicted novelty 6.0

    A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user ...

  16. RAGShield: Detecting Numerical Claim Manipulation in Government RAG Systems

    cs.CR 2026-04 conditional novelty 6.0

    RAGShield detects all numerical manipulations in government RAG systems via pattern-based value extraction and cross-source verification, achieving 0% attack success rate on 430 real IRS-derived attacks where embeddin...

  17. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    cs.CL 2024-04 conditional novelty 6.0

    MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.

  18. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

    cs.LG 2024-01 unverdicted novelty 6.0

    EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.

  19. DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models

    cs.LG 2026-05 unverdicted novelty 5.0

    DP-LAC provides a new adaptive clipping technique for DP-SGD in federated LLM fine-tuning that improves accuracy by 6.6% on average without consuming additional privacy budget or requiring new hyperparameters.

  20. TabEmb: Joint Semantic-Structure Embedding for Table Annotation

    cs.LG 2026-04 unverdicted novelty 5.0

    TabEmb decouples LLM-based semantic column embeddings from graph-based structural modeling to produce joint representations that improve table annotation tasks.

  21. Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

    cs.LG 2026-04 unverdicted novelty 5.0

    ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.

  22. Acceptance Dynamics Across Cognitive Domains in Speculative Decoding

    cs.AI 2026-04 unverdicted novelty 5.0

    Empirical measurements across four NLP domains show task type is a stronger predictor of speculative decoding acceptance than tree depth, with chat uniquely achieving expected accepted length over 1 token per step.

  23. VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

    cs.CL 2026-05 unverdicted novelty 4.0

    VectraYX-Nano is a 42M-parameter Spanish cybersecurity LLM trained with curriculum learning and native MCP tool use, achieving 0.78 conversational gate and improved tool selection with denser data.

  24. Agentic Performance at the Edge: Insights from Benchmarking

    cs.AI 2026-05 unverdicted novelty 4.0

    Edge agentic AI quality is not a simple function of model size; robust results require joint design of model selection and tool integration, as revealed by domain-conditioned benchmarks showing accuracy-latency Pareto fronts.

  25. OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis

    cs.CR 2026-04 unverdicted novelty 4.0

    LoRA fine-tuning of TinyLlama-1.1B on 450 SOC examples produces 68% threat classification accuracy and 58% severity accuracy on 50 held-out logs, with full code, weights, and data released.

  26. DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs

    cs.CR 2026-04 unverdicted novelty 4.0

    DP-FLogTinyLLM combines federated learning, differential privacy, and LoRA-tuned tiny LLMs to match centralized log anomaly detection performance on Thunderbird and BGL datasets while preserving privacy.

  27. An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

    cs.SE 2026-04 unverdicted novelty 4.0

    Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.

  28. Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

    cs.DC 2026-04 unverdicted novelty 3.0

    A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

  29. ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook

    eess.SP 2026-04 unverdicted novelty 3.0

    ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 28 Pith papers · 13 internal anchors

  1. [1]

    Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y ., Lebron, F., and Sanghai, S. (2023). GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP

  2. [2]

    Yu, J., Zhang, Q., Zheng, S., Zheng, C., Zhou, W., Zhou, D., Petrov, S., and Wu, Y . (2023). Palm 2 technical report

  3. [3]

    Biderman, S., and Welleck, S. (2023). Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631

  4. [4]

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y ., Yu, B., Yuan, H., Y...

  5. [5]

    Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al. (2024). Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954

  6. [6]

    S., Raff, E., et al

    Purohit, S., Prashanth, U. S., Raff, E., et al. (2023). Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of ICML

  7. [7]

    Bisk, Y ., Zellers, R., Gao, J., Choi, Y ., et al. (2020). Piqa: Reasoning about physical commonsense in natural language. In Proceedings of AAAI

  8. [8]

    Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. In Proceedings of NeurIPS

  9. [9]

    K., Hong, P., Bing, L., and Poria, S

    Chia, Y . K., Hong, P., Bing, L., and Poria, S. (2023). INSTRUCTEV AL: towards holistic evaluation of instruction-tuned large language models. CoRR, abs/2306.04757

  10. [10]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

  11. [11]

    Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. (2019). BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL

  12. [12]

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. (2018). Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457

  13. [13]

    Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., and Stoyanov, V . (2018). XNLI: Evaluating cross-lingual sentence representations. In Riloff, E., Chiang, D.,

  14. [14]

    Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691

  15. [15]

    N., Fan, A., Auli, M., and Grangier, D

    Dauphin, Y . N., Fan, A., Auli, M., and Grangier, D. (2017). Language modeling with gated convolutional networks. In Proceedings of ICML

  16. [16]

    Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. (2019). DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of NAACL

  17. [17]

    and Sennrich, R

    Emelin, D. and Sennrich, R. (2021). Wino-X: Multilingual Winograd schemas for commonsense reasoning and coreference resolution. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t., editors, Proceedings of EMNLP, pages 8517–8532, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

  18. [18]

    Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. (2023). A framework for few-shot language model evaluation

  19. [19]

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021). Measuring massive multitask language understanding. In Proceedings of ICLR

  20. [20]

    A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J

    Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Vinyals, O., Rae, J. W., and Sifre, L. (2022). Training compute-optimal large language models. In Proceedings of NeurIPS. 8

  21. [21]

    Hu, S., Tu, Y ., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y ., Huang, Y ., Zhao, W., et al. (2024). Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395

  22. [22]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

  23. [23]

    Lefaudeux, B., Massa, F., Liskovich, D., Xiong, W., Caggiano, V ., Naren, S., Xu, M., Hu, J., Tintore, M., Zhang, S., Labatut, P., and Haziza, D. (2022). xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers

  24. [24]

    V ., and de Vries, H

    Hughes, S., Wolf, T., Guha, A., Werra, L. V ., and de Vries, H. (2023). Starcoder: may the source be with you! Transactions on Machine Learning Research. Lightning-AI (2023). Lit-gpt

  25. [25]

    S., Chaudhary, V ., O’Horo, B., Wang, J., Zettlemoyer, L., Kozareva, Z., Diab, M., Stoyanov, V ., and Li, X

    Du, J., Pasunuru, R., Shleifer, S., Koura, P. S., Chaudhary, V ., O’Horo, B., Wang, J., Zettlemoyer, L., Kozareva, Z., Diab, M., Stoyanov, V ., and Li, X. (2022). Few-shot learning with multilingual generative language models. In Goldberg, Y ., Kozareva, Z., and Zhang, Y ., editors,Proceedings of EMNLP, pages 9019–9052, Abu Dhabi, United Arab Emirates. As...

  26. [26]

    and Hutter, F

    Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization. In Proceedings of ICLR

  27. [27]

    Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. (2018). Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of EMNLP. OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  28. [28]

    M., Glavaš, G., Majewska, O., Liu, Q., Vuli´c, I., and Korhonen, A

    Ponti, E. M., Glavaš, G., Majewska, O., Liu, Q., Vuli´c, I., and Korhonen, A. (2020). XCOPA: A multilingual dataset for causal commonsense reasoning. In Webber, B., Cohn, T., He, Y ., and Liu, Y ., editors,Proceedings of EMNLP , pages 2362–2376, Online. Association for Computational Linguistics

  29. [29]

    L., Bhagavatula, C., and Choi, Y

    Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y . (2021). Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106

  30. [30]

    Shazeer, N. (2020). GLU variants improve transformer. CoRR, abs/2002.05202

  31. [31]

    R., Hestness, J., and Dey, N

    Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., and Dey, N. (2023). SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

  32. [32]

    Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615

  33. [33]

    Su, J., Lu, Y ., Pan, S., Murtadha, A., Wen, B., and Liu, Y . (2021). Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864

  34. [34]

    Chi, E., Zhou, D., and Wei, J. (2023). Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of ACL. Thaddée, Y . T. (2023). Chinchilla’s death. https://espadrine.github.io/blog/posts/chinchilla-s- death.html. 9 Together Computer (2023). Redpajama: an open dataset for training large language models

  35. [35]

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023a). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  36. [36]

    Bhargava, P., Bhosale, S., et al. (2023b). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  37. [37]

    Polosukhin, I. (2017). Attention is all you need. In Proceedings of NeurIPS

  38. [38]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. In Proceedings of NeurIPS

  39. [39]

    Wei, T., Zhao, L., Zhang, L., Zhu, B., Wang, L., Yang, H., Li, B., Cheng, C., Lü, W., Hu, R., et al. (2023). Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341

  40. [40]

    Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . (2019). HellaSwag: Can a machine really finish your sentence? In Proceedings of the ACL

  41. [41]

    and Sennrich, R

    Zhang, B. and Sennrich, R. (2019). Root mean square layer normalization. In Proceedings of NeurIPS

  42. [42]

    OPT: Open Pre-trained Transformer Language Models

    Lin, X. V ., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

  43. [43]

    Su, T., Yang, Z., and Tang, J. (2023). Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023 , pages 5673–5684. ACM. Based on this observation, we reduced the number of tokens ...